Methods for feature selection in a learning machine

ABSTRACT

In a pre-processing step prior to training a learning machine, pre-processing includes reducing the quantity of features to be processed using feature selection methods selected from the group consisting of recursive feature elimination (RFE), minimizing the number of non-zero parameters of the system (l o -norm minimization), evaluation of cost function to identify a subset of features that are compatible with constraints imposed by the learning set, unbalanced correlation score and transductive feature selection. The features remaining after feature selection are then used to train a learning machine for purposes of pattern classification, regression, clustering and/or novelty detection. ( FIG. 3 , 300, 301, 302, 304, 306, 308, 309, 310, 311, 312, 314)

RELATED APPLICATIONS

The present application claims priority of each of the following U.S.provisional patent applications: Ser. No. 60/292,133, filed May 18,2001, Ser. No. 60/292,978, filed May 23, 2001, and Ser. No. 60/332,021,filed Nov. 21, 2001, and, for U.S. national stage purposes, is acontinuation-in-part of U.S. patent application Ser. No. 10/057,849,filed Jan. 24, 2002, now issued as U.S. Pat. No. 7,117,188, which is acontinuation-in-part of application Ser. No. 09/633,410, filed Aug. 7,2000, now issued as U.S. Pat. No. 6,882,990.

This application is related to, but does not claim priority to, thefollowing applications: application Ser. No. 09/578,011, filed May 24,2000, now issued as U.S. Pat. No. 6,658,395, which is acontinuation-in-part of application Ser. No. 09/568,301, filed May 9,2000, now issued as U.S. Pat. No. 6,427,141, which is a continuation ofapplication Ser. No. 09/303,387, filed May 1, 1999, now issued as U.S.Pat. No. 6,128,608, which claims priority to U.S. provisionalapplication Ser. No. 60/083,961, filed May 1, 1998. This application isrelated to applications Ser. No. 09/633,615, now abandoned, Ser. No.09/633,616, now issued as U.S. Pat. No. 6,760,715, Ser. No. 09/633,627,now issued as U.S. Pat. No. 6,714,925, and Ser. No. 09/633,850, nowissued as U.S. Pat. No. 6,789,069, all filed Aug. 7, 2000, which arealso continuations-in-part of application Ser. No. 09/578,011. Thisapplication is also related to applications Ser. No. 09/303,386, nowabandoned, and Ser. No. 09/305,345, now issued as Pat. No. 6,157,921,both filed May 1, 1999, and to application Ser. No. 09/715,832, filedNov. 14, 2000, now abandoned, all of which also claim priority toprovisional application Ser. No. 60/083,961. Each of theabove-identified applications is incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to the use of learning machines toidentify relevant patterns in datasets containing large quantities ofdiverse data, and more particularly to a method and system for selectionof features within the data sets which best enable identification ofrelevant patterns.

BACKGROUND OF THE INVENTION

Knowledge discovery is the most desirable end product of datacollection. Recent advancements in database technology have lead to anexplosive growth in systems and methods for generating, collecting andstoring vast amounts of data. While database technology enablesefficient collection and storage of large data sets, the challenge offacilitating human comprehension of the information in this data isgrowing ever more difficult. With many existing techniques the problemhas become unapproachable. Thus, there remains a need for a newgeneration of automated knowledge discovery tools.

As a specific example, the Human Genome Project has completed sequencingof the human genome. The complete sequence contains a staggering amountof data, with approximately 31,500 genes in the whole genome. The amountof data relevant to the genome must then be multiplied when consideringcomparative and other analyses that are needed in order to make use ofthe sequence data. To illustrate, human chromosome 20 alone comprisesnearly 60 million base pairs. Several disease-causing genes have beenmapped to chromosome 20 including various autoimmune diseases, certainneurological diseases, type 2 diabetes, several forms of cancer, andmore, such that considerable information can be associated with thissequence alone.

One of the more recent advances in determining the functioningparameters of biological systems is the analysis of correlation ofgenomic information with protein functioning to elucidate therelationship between gene expression, protein function and interaction,and disease states or progression. Proteomics is the study of the groupof proteins encoded and regulated by a genome. Genomic activation orexpression does not always mean direct changes in protein productionlevels or activity. Alternative processing of MRNA orpost-transcriptional or post-translational regulatory mechanisms maycause the activity of one gene to result in multiple proteins, all ofwhich are slightly different with different migration patterns andbiological activities. The human proteome is believed to be 50 to 100times larger than the human genome. Currently, there are no methods,systems or devices for adequately analyzing the data generated by suchbiological investigations into the genome and proteome.

In recent years, machine-learning approaches for data analysis have beenwidely explored for recognizing patterns which, in turn, allowextraction of significant information contained within a large data setthat may also include data consisting of nothing more than irrelevantdetail. Learning machines comprise algorithms that may be trained togeneralize using data with known outcomes. Trained learning machinealgorithms may then be applied to predict the outcome in cases ofunknown outcome, i.e., to classify the data according to learnedpatterns. Machine-learning approaches, which include neural networks,hidden Markov models, belief networks and kernel-based classifiers suchas support vector machines, are ideally suited for domains characterizedby the existence of large amounts of data, noisy patterns and theabsence of general theories. Support vector machines are disclosed inU.S. Pat. Nos. 6,128,608 and 6,157,921, both of which are assigned tothe assignee of the present application and are incorporated herein byreference.

The quantities introduced to describe the data that is input into alearning machine are typically referred to as “features”, while theoriginal quantities are sometimes referred to as “attributes”. A commonproblem in classification, and machine learning in general, is thereduction of dimensionality of feature space to overcome the risk of“overfitting”. Data overfitting arises when the number n of features islarge, such as the thousands of genes studied in a microarray, and thenumber of training patterns is comparatively small, such as a few dozenpatients. In such situations, one can find a decision function thatseparates the training data, even a linear decision function, but itwill perform poorly on test data. The task of choosing the most suitablerepresentation is known as “feature selection”.

A number of different approaches to feature selection exists, where oneseeks to identify the smallest set of features that still conveys theessential information contained in the original attributes. This isknown as “dimensionality reduction” and can be very beneficial as bothcomputational and generalization performance can degrade as the numberof features grows, a phenomenon sometimes referred to as the “curse ofdimensionality.”

Training techniques that use regularization, i.e., restricting the classof admissible solutions, can avoid overfitting the data withoutrequiring space dimensionality reduction. Support Vector Machines (SVMs)use regularization, however even SVMs can benefit from spacedimensionality (feature) reduction.

The problem of feature selection is well known in pattern recognition.In many supervised learning problems, feature selection can be importantfor a variety of reasons including generalization performance, runningtime requirements and constraints and interpretational issues imposed bythe problem itself. Given a particular classification technique, one canselect the best subset of features satisfying a given “model selection”criterion by exhaustive enumeration of all subsets of features. However,this method is impractical for large numbers of features, such asthousands of genes, because of the combinatorial explosion of the numberof subsets.

One method of feature reduction is projecting on the first few principaldirections of the data. Using this method, new features are obtainedthat are linear combinations of the original features. One disadvantageof projection methods is that none of the original input features can bediscarded. Preferred methods incorporate pruning techniques thateliminate some of the original input features while retaining a minimumsubset of features that yield better classification performance. Fordesign of diagnostic tests, it is of practical importance to be able toselect a small subset of genes for cost effectiveness and to permit therelevance of the genes selected to be verified more easily.

Accordingly, the need remains for a method for selection of the featuresto be used by a learning machine for pattern recognition which stillminimizes classification error.

SUMMARY OF THE INVENTION

In an exemplary embodiment, the present invention comprisespreprocessing a training data set in order to allow the mostadvantageous application of the learning machine. Each training datapoint comprises a vector having one or more coordinates. Pre-processingthe training data set may comprise identifying missing or erroneous datapoints and taking appropriate steps to correct the flawed data or asappropriate remove the observation or the entire field from the scope ofthe problem. In a preferred embodiment, pre-processing includes reducingthe quantity of features to be processed using feature selection methodsselected from the group consisting of recursive feature elimination(RFE), minimizing the number of non-zero parameters of the system(l₀-norm minimization), evaluation of cost function to identify a subsetof features that are compatible with constraints imposed by the learningset, unbalanced correlation score and transductive feature selection.The features remaining after feature selection are then used to train alearning machine for purposes of pattern classification, regression,clustering and/or novelty detection. In a preferred embodiment, thelearning machine is a kernel-based classifier. In the most preferredembodiment, the learning machine comprises a plurality of support vectormachines.

A test data set is pre-processed in the same manner as was the trainingdata set. Then, the trained learning machine is tested using thepre-processed test data set. A test output of the trained learningmachine may be post-processing to determine if the test output is anoptimal solution based on known outcome of the test data set.

In the context of a a kernel-based learning machine such as a supportvector machine, the present invention also provides for the selection ofat least one kernel prior to training the support vector machine. Theselection of a kernel may be based on prior knowledge of the specificproblem being addressed or analysis of the properties of any availabledata to be used with the learning machine and is typically dependant onthe nature of the knowledge to be discovered from the data.

Kernels are usually defined for patterns that can be represented as avector of real numbers. For example, linear Kernels, radial basisfunction kernels and polynomial kernels all measure the similarity of apair of real vectors. Such kernels are appropriate when the patterns arebest represented as a sequence of real numbers.

An iterative process comparing postprocessed training outputs or testoutputs can be applied to make a determination as to which kernelconfiguration provides the optimal solution. If the test output is notthe optimal solution, the selection of the kernel may be adjusted andthe support vector machine may be retrained and retested. Once it isdetermined that the optimal solution has been identified, a live dataset may be collected and pre-processed in the same manner as was thetraining data set to select the features that best represent the data.The pre-processed live data set is input into the learning machine forprocessing. The live output of the learning machine may then bepost-processed by interpreting the live output into a computationallyderived alphanumeric classifier or other form suitable to furtherutilization of the SVM derived answer.

In an exemplary embodiment a system is provided enhancing knowledgediscovered from data using a support vector machine. The exemplarysystem comprises a storage device for storing a training data set and atest data set, and a processor for executing a support vector machine.The processor is also operable for collecting the training data set fromthe database, pre-processing the training data set, training the supportvector machine using the pre-processed training data set, collecting thetest data set from the database, pre-processing the test data set in thesame manner as was the training data set, testing the trained supportvector machine using the pre-processed test data set, and in response toreceiving the test output of the trained support vector machine,post-processing the test output to determine if the test output is anoptimal solution. The exemplary system may also comprise acommunications device for receiving the test data set and the trainingdata set from a remote source. In such a case, the processor may beoperable to store the training data set in the storage device priorpre-processing of the training data set and to store the test data setin the storage device prior pre-processing of the test data set. Theexemplary system may also comprise a display device for displaying thepost-processed test data. The processor of the exemplary system mayfurther be operable for performing each additional function describedabove. The communications device may be further operable to send acomputationally derived alphanumeric classifier or other SVM-based rawor post-processed output data to a remote source.

In an exemplary embodiment, a system and method are provided forenhancing knowledge discovery from data using multiple learning machinesin general and multiple support vector machines in particular. Trainingdata for a learning machine is pre-processed. Multiple support vectormachines, each comprising distinct kernels, are trained with thepre-processed training data and are tested with test data that ispre-processed in the same manner. The test outputs from multiple supportvector machines are compared in order to determine which of the testoutputs if any represents an optimal solution. Selection of one or morekernels may be adjusted and one or more support vector machines may beretrained and retested. When it is determined that an optimal solutionhas been achieved, live data is pre-processed and input into the supportvector machine comprising the kernel that produced the optimal solution.The live output from the learning machine may then be post-processed asneeded to place the output in a format appropriate for interpretation bya human or another computer.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart illustrating an exemplary general method forincreasing knowledge that may be discovered from data using a learningmachine.

FIG. 2 is a flowchart illustrating an exemplary method for increasingknowledge that may be discovered from data using a support vectormachine.

FIG. 3 is a flowchart illustrating an exemplary optimal categorizationmethod that may be used in a stand-alone configuration or in conjunctionwith a learning machine for pre-processing or post-processing techniquesin accordance with an exemplary embodiment of the present invention.

FIG. 4 is a functional block diagram illustrating an exemplary operatingenvironment for an embodiment of the present invention.

FIG. 5 is a functional block diagram illustrating a hierarchical systemof multiple support vector machines.

FIG. 6 shows graphs of the results of using RFE.

FIG. 7 shows the results of RFE after preprocessing.

FIG. 8 shows the results of RFE when training on 100 dense QT_clustclusters.

FIG. 9 shows a comparison of feature (gene) selection methods for coloncancer data.

FIG. 10 shows the selection of an optimum number of genes for coloncancer data.

FIGS. 11 a-f are plots comparing results with SVM and l₀-SVMapproximation for learning sparse and non-sparse target functions withpolynomials of degree 2 over 5 inputs, where FIGS. 11 a and 11 b showthe results of SVM target functions ƒ(x)=x₁x₂+x₃ and ƒ(x)=x₁,respectively; and FIGS. 11 c-f provide the results for 95%, 90%, 80% and0% sparse poly degree 2, respectively.

FIGS. 12 a-f are plots comparing results with SVM and l₂-AL0M SVMapproximation for learning sparse and non-sparse target functions withpolynomials of degree 2 over 10 inputs, where FIGS. 12 a and 12 b showthe results of SVM target functions ƒ(x)=x₁x₂+x₃ and ƒ(x)=x₁,respectively; and FIGS. 12 c-f provide the results for 95%, 90%, 80% and0% sparse poly degree 2, respectively.

FIGS. 13 a-b are plots of the results of a sparse SVM and a classicalSVM, respectively, using a RBF kernel with a value of σ=0.1.

FIGS. 14 a-d are plots of the results of a sparse SVM (a,c) and aclassical SVM (b,d).

FIGS. 15 a-d are plots of the results of performingguaranteed-distortion mk-vector quantization on two dimensional uniformdata with varying distortion levels, where the quantizing vectors arem=37, 11, 6 and 4,respectively.

FIG. 16 is a plot of showing the Hamming distance for a dataset havingfive labels and two possible label sets.

FIG. 17 illustrates a simple toy problem with three labels, one of whichis associated with all inputs.

FIG. 18. is a histogram of error in the binary approach in the toyproblem of FIG. 17.

FIGS. 19 a-c are histograms of the leave-one-out estimate of the HammingLoss for the Prostate Cancer Database relating to the embodiment offeature selection in multi-label cases for 7 labels where FIG. 19 a is ahistogram of the errors for a direct approach; FIG. 19 b is a histogramof the errors for the binary approach; and FIG. 19 c is a histogram ofthe errors for the binary approach when the system is forced to outputat least one label.

FIGS. 20 a-c are histograms of the leave-one-out estimate of the HammingLoss for the Prostate Cancer Database relating to the embodiment offeature selection in multi-label cases for 4 labels where FIG. 20 a is ahistogram of the errors for the direct approach; FIG. 20 b is ahistogram of the errors for the binary approach; and FIG. 20 c is ahistogram of the errors for the binary approach when the system isforced to output at least one label.

FIG. 21 shows the distribution of the mistakes using the leave-one-outestimate of the Hamming Loss for the Prostate Cancer Database using 4labels with feature selection. FIG. 21 a is a histogram of the errorsfor the direct approach where the value of the one bar is 11. FIG. 21 bis a histogram of the errors for the binary approach, where the valuesof the bars are from left to right: 12 and 7.

FIGS. 22 a and b are plots showing the-results of the transductive SVMusing the CORR_(ub) feature selection method: FIG. 22 a provides theresults for 4 to 20 features compared with the inductive SVM method;FIG. 22 b. provides the results for the transductive method using 4 to100 features.

FIGS. 23 a and b are plots showing the results of the transductiveCORR_(ub) ² method compared to the inductive version of the CORR_(ub) ²method, with FIG. 23 a showing the results for 4 to 25 features and FIG.23 b the results for 4 to 100 features.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention provides methods, systems and devices fordiscovering knowledge from data using leaming machines. Particularly,the present invention is directed to methods, systems and devices forknowledge discovery from data using learning machines that are providedinformation regarding changes in biological systems. More particularly,the present invention comprises methods of use of such knowledge fordiagnosing and prognosing changes in biological systems such asdiseases. Additionally, the present invention comprises methods,compositions and devices for applying such knowledge to the testing andtreating of individuals with changes in their individual biologicalsystems. Preferred embodiments comprise detection of genes involved withprostate cancer and use of such information for treatment of patientswith prostate cancer.

As used herein, “biological data” means any data derived from measuringbiological conditions of human, animals or other biological organismsincluding microorganisms, viruses, plants and other living organisms.The measurements may be made by any tests, assays or observations thatare known to physicians, scientists, diagnosticians, or the like.Biological data may include, but is not limited to, clinical tests andobservations, physical and chemical measurements, genomicdeterminations, proteomic determinations, drug levels, hormonal andimmunological tests, neurochemical or neurophysical measurements,mineral and vitamin level determinations, genetic and familialhistories, and other determinations that may give insight into the stateof the individual or individuals that are undergoing testing. Herein,the use of the term “data” is used interchangeably with “biologicaldata”.

While several examples of learning machines exist and advancements areexpected in this field, the exemplary embodiments of the presentinvention focus on kernel-based learning machines and more particularlyon the support vector machine.

The present invention can be used to analyze biological data generatedat multiple stages of investigation into biological functions, andfurther, to integrate the different kinds of data for novel diagnosticand prognostic deterninations. For example, biological data obtainedfrom clinical case information, such as diagnostic test data, family orgenetic histories, prior or current medical treatments, and the clinicaloutcomes of such activities, can be utilized in the methods, systems anddevices of the present invention. Additionally, clinical samples such asdiseased tissues or fluids, and normal tissues and fluids, and cellseparations can provide biological data that can be utilized by thecurrent invention. Proteomic determinations such as 2-D gel, massspectrophotometry and antibody screening can be used to establishdatabases that can be utilized by the present invention. Genomicdatabases can also be used alone or in combination with theabove-described data and databases by the present invention to providecomprehensive diagnosis, prognosis or predictive capabilities to theuser of the present invention.

A first aspect of the present invention facilitates analysis of data bypre-processing the data prior to using the data to train a learningmachine and/or optionally post-processing the output from a learningmachine. Generally stated, pre-processing data comprises reformatting oraugmenting the data in order to allow the learning machine to be appliedmost advantageously. More specifically, pre-processing involvesselecting a method for reducing the dimensionality of the feature space,i.e., selecting the features which best represent the data. Methodswhich may be used for this purpose include recursive feature elimination(RFE), minimizing the number of non-zero parameters of the system,evaluation of cost function to identify a subset of features that arecompatible with constraints imposed by the learning set, unbalancedcorrelation score, inductive feature selection and transductive featureselection. The features remaining after feature selection are then usedto train a learning machine for purposes of pattern classification,regression, clustering and/or novelty detection.

In a manner similar to pre-processing, post-processing involvesinterpreting the output of a learning machine in order to discovermeaningful characteristics thereof. The meaningful characteristics to beascertained from the output may be problem- or data-specific.Post-processing involves interpreting the output into a form that, forexample, may be understood by or is otherwise useful to a humanobserver, or converting the output into a form which may be readilyreceived by another device for, e.g., archival or transmission.

FIG. 1 is a flowchart illustrating a general method 100 for analyzingdata using learning machines. The method 100 begins at starting block101 and progresses to step 102 where a specific problem is formalizedfor application of analysis through machine learning. Particularlyimportant is a proper formulation of the desired output of the learningmachine. For instance, in predicting future performance of an individualequity instrument, or a market index, a learning machine is likely toachieve better performance when predicting the expected future changerather than predicting the future price level. The future priceexpectation can later be derived in a post-processing step as will bediscussed later in this specification.

After problem formalization, step 103 addresses training datacollection. Training data comprises a set of data points having knowncharacteristics. This data may come from customers, research facilities,academic institutions, national laboratories, commercial entities orother public or confidential sources. The source of the data and thetypes of data provided are not crucial to the methods. Training data maybe collected from one or more local and/or remote sources. The data maybe provided through any means such as via the internet, server linkagesor discs, CD/ROMs, DVDs or other storage means. The collection oftraining data may be accomplished manually or by way of an automatedprocess, such as known electronic data transfer methods. Accordingly, anexemplary embodiment of the learning machine for use in conjunction withthe present invention may be implemented in a networked computerenvironment. Exemplary operating environments for implementing variousembodiments of the learning machine will be described in detail withrespect to FIGS. 4-5.

At step 104, the collected training data is optionally pre-processed inorder to allow the learning machine to be applied most advantageouslytoward extraction of the knowledge inherent to the training data Duringthis preprocessing stage a variety of different transformations can beperformed on the data to enhance its usefulness. Such transformations,examples of which include addition of expert information, labeling,binary conversion, Fourier transformations, etc., will be readilyapparent to those of skill in the art. However, the preprocessing ofinterest in the present invention is the reduction of dimensionality byway of feature selection, different methods of which are described indetail below.

Returning to FIG. 1, an exemplary method 100 continues at step 106,where the learning machine is trained using the pre-processed data. Asis known in the art, a learning machine is trained by adjusting itsoperating parameters until a desirable training output is achieved. Thedetermination of whether a training output is desirable may beaccomplished either manually or automatically by comparing the trainingoutput to the known characteristics of the training data. A learningmachine is considered to be trained when its training output is within apredetermined error threshold from the known characteristics of thetraining data. In certain situations, it may be desirable, if notnecessary, to post-process the training output of the learning machineat step 107. As mentioned, post-processing the output of a learningmachine involves interpreting the output into a meaningful form. In thecontext of a regression problem, for example, it may be necessary todetermine range categorizations for the output of a learning machine inorder to determine if the input data points were correctly categorized.In the example of a pattern recognition problem, it is often notnecessary to post-process the training output of a learning machine.

At step 108, test data is optionally collected in preparation fortesting the trained learning machine. Test data may be collected fromone or more local and/or remote sources. In practice, test data andtraining data may be collected from the same source(s) at the same time.Thus, test data and training data sets can be divided out of a commondata set and stored in a local storage medium for use as different inputdata sets for a learning machine. Regardless of how the test data iscollected, any test data used must be pre-processed at step 110 in thesame manner as was the training data. As should be apparent to thoseskilled in the art, a proper test of the learning may only beaccomplished by using testing data of the same format as the trainingdata. Then, at step 112 the learning machine is tested using thepre-processed test data, if any. The test output of the learning machineis optionally post-processed at step 114 in order to determine if theresults are desirable. Again, the post processing step involvesinterpreting the test output into a meaningful form. The meaningful formmay be one that is readily understood by a human or one that iscompatible with another processor. Regardless, the test output must bepost-processed into a form which may be compared to the test data todetermine whether the results were desirable. Examples ofpost-processing steps include but are not limited of the following:optimal categorization determinations, scaling techniques (linear andnon-linear), transformations (linear and non-linear), and probabilityestimations. The method 100 ends at step 116.

FIG. 2 is a flow chart illustrating an exemplary method 200 forenhancing knowledge that may be discovered from data using a specifictype of learning machine known as a support vector machine (SVM). A SVMimplements a specialized algorithm for providing generalization whenestimating a multi-dimensional function from a limited collection ofdata. A SVM may be particularly useful in solving dependency estimationproblems. More specifically, a SVM may be used accurately in estimatingindicator functions (e.g. pattern recognition problems) and real-valuedfunctions (e.g. function approximation problems, regression estimationproblems, density estimation problems, and solving inverse problems).The SVM was originally developed by Boser, Guyon and Vapnik (“A trainingalogorithm for optimal margin classifiers”, Fifth annual Workshop onComputational Learning Theory, Pittsburch, ACM (1992) pp. 142-152). Theconcepts underlying the SVM are also explained in detail in Vapnik'sbook, entitled Statistical Leaning Theory (John Wiley & Sons, Inc.1998), which is herein incorporated by reference in its entirety.Accordingly, a familiarity with SVMs and the terminology used therewithare presumed throughout this specification.

The exemplary method 200 begins at starting block 201 and advances tostep 202, where a problem is formulated and then to step 203, where atraining data set is collected. As was described with reference to FIG.1, training data may be collected from one or more local and/or remotesources, through a manual or automated process. At step 204 the trainingdata is optionally pre-processed. Those skilled in the art shouldappreciate that SVMs are capable of processing input data havingextremely large dimensionality, however, according to the presentinvention, pre-processing includes the use of feature selection methodsto reduce the dimensionality of feature space.

At step 206 a kernel is selected for the SVM. As is known in the art,different kernels will cause a SVM to produce varying degrees of qualityin the output for a given set of input data. Therefore, the selection ofan appropriate kernel may be essential to the desired quality of theoutput of the SVM. In one embodiment of the learning machine, a kernelmay be chosen based on prior performance knowledge. As is known in theart, exemplary kernels include polynomial kernels, radial basisclassifier kernels, linear kernels, etc. In an alternate embodiment, acustomized kernel may be created that is specific to a particularproblem or type of data set. In yet another embodiment, the multipleSVMs may be trained and tested simultaneously, each using a differentkernel. The quality of the outputs for each simultaneously trained andtested SVM may be compared using a variety of selectable or weightedmetrics (see step 222) to determine the most desirable kernel. In stillanother embodiment which is particularly advantageous for use withstructured data, locational kernels are defined to exploit patternswithin the structure. The locational kernels are then used to constructkernels on the structured object.

Next, at step 208 the pre-processed training data is input into the SVM.At step 210, the SVM is trained using the pre-processed training data togenerate an optimal hyperplane. Optionally, the training output of theSVM may then be post-processed at step 211. Again, post-processing oftraining output may be desirable, or even necessary, at this point inorder to properly calculate ranges or categories for the output. At step212 test data is collected similarly to previous descriptions of datacollection. The test data is pre-processed at step 214 in the samemanner as was the training data above. Then, at step 216 thepre-processed test data is input into the SVM for processing in order todetermine whether the SVM was trained in a desirable manner. The testoutput is received from the SVM at step 218 and is optionallypost-processed at step 220.

Based on the post-processed test output, it is determined at step 222whether an optimal minimum was achieved by the SVM. Those skilled in theart should appreciate that a SVM is operable to ascertain an outputhaving a global minimum error. However, as mentioned above, outputresults of a SVM for a given data set will typically vary with kernelselection. Therefore, there are in fact multiple global minimums thatmay be ascertained by a SVM for a given set of data. As used herein, theterm “optimal minimum” or “optimal solution” refers to a selected globalminimum that is considered to be optimal (e.g. the optimal solution fora given set of problem specific, pre-established criteria) when comparedto other global minimums ascertained by a SVM. Accordingly, at step 222,determining whether the optimal minimum has been ascertained may involvecomparing the output of a SVM with a historical or predetermined value.Such a predetermined value may be dependant on the test data set. Forexample, in the context of a pattern recognition problem where datapoints are classified by a SVM as either having a certain characteristicor not having the characteristic, a global minimum error of 50% wouldnot be optimal. In this example, a global minimum of 50% is no betterthan the result that would be achieved by flipping a coin to determinewhether the data point had that characteristic. As another example, inthe case where multiple SVMs are trained and tested simultaneously withvarying kernels, the outputs for each SVM may be compared with output ofother SVM to determine the practical optimal solution for thatparticular set of kernels. The determination of whether an optimalsolution has been ascertained may be performed manually or through anautomated comparison process.

If it is determined that the optimal minimum has not been achieved bythe trained SVM, the method advances to step 224, where the kernelselection is adjusted. Adjustment of the kernel selection may compriseselecting one or more new kernels or adjusting kernel parameters.Furthermore, in the case where multiple SVMs were trained and testedsimultaneously, selected kernels may be replaced or modified while otherkernels may be re-used for control purposes. After the kernel selectionis adjusted, the method 200 is repeated from step 208, where thepre-processed training data is input into the SVM for training purposes.When it is determined at step 222 that the optimal minimum has beenachieved, the method advances to step 226, where live data is collectedsimilarly as described above. By definition, live data has not beenpreviously evaluated, so that the desired output characteristics thatwere known with respect to the training data and the test data are notknown.

At step 228 the live data is pre-processed in the same manner as was thetraining data and the test data. At step 230, the live pre-processeddata is input into the SVM for processing. The live output of the SVM isreceived at step 232 and is post-processed at step 234. The method 200ends at step 236.

FIG. 3 is a flow chart illustrating an exemplary optimal categorizationmethod 300 that may be used for pre-processing data or post-processingoutput from a learning machine. Additionally, as will be describedbelow, the exemplary optimal categorization method may be used as astand-alone categorization technique, independent from learningmachines. The exemplary optimal categorization method 300 begins atstarting block 301 and progresses to step 302, where an input data setis received. The input data set comprises a sequence of data samplesfrom a continuous variable. The data samples fall within two or moreclassification categories. Next, at step 304 the bin and class-trackingvariables are initialized. As is known in the art, bin variables relateto resolution, while class-tracking variables relate to the number ofclassifications within the data set. Determining the values forinitialization of the bin and class-tracking variables may be performedmanually or through an automated process, such as a computer program foranalyzing the input data set. At step 306, the data entropy for each binis calculated. Entropy is a mathematical quantity that measures theuncertainty of a random distribution. In the exemplary method 300,entropy is used to gauge the gradations of the input variable so thatmaximum classification capability is achieved.

The method 300 produces a series of “cuts” on the continuous variable,such that the continuous variable may be divided into discretecategories. The cuts selected by the exemplary method 300 are optimal inthe sense that the average entropy of each resulting discrete categoryis minimized. At step 308, a determination is made as to whether allcuts have been placed within input data set comprising the continuousvariable. If all cuts have not been placed, sequential bin combinationsare tested for cutoff determination at step 310. From step 310, theexemplary method 300 loops back through step 306 and returns to step 308where it is again determined whether all cuts have been placed withininput data set comprising the continuous variable. When all cuts havebeen placed, the entropy for the entire system is evaluated at step 309and compared to previous results from testing more or fewer cuts. If itcannot be concluded that a minimum entropy state has been determined,then other possible cut selections must be evaluated and the methodproceeds to step 311. From step 311 a heretofore untested selection fornumber of cuts is chosen and the above process is repeated from step304. When either the limits of the resolution determined by thebin-width has been tested or the convergence to a minimum solution hasbeen identified, the optimal classification criteria is output at step312 and the exemplary optimal categorization method 300 ends at step314.

The optimal categorization method 300 takes advantage of dynamicprogramming techniques. As is known in the art, dynamic programmingtechniques may be used to significantly improve the efficiency ofsolving certain complex problems through carefully structuring analgorithm to reduce redundant calculations. In the optimalcategorization problem, the straightforward approach of exhaustivelysearching through all possible cuts in the continuous variable datawould result in an algorithm of exponential complexity and would renderthe problem intractable for even moderate sized inputs. By takingadvantage of the additive property of the target function, in thisproblem the average entropy, the problem may be divided into a series ofsub-problems. By properly formulating algorithmic sub-structures forsolving each sub-problem and storing the solutions of the sub-problems,a significant amount of redundant computation may be identified andavoided. As a result of using the dynamic programming approach, theexemplary optimal categorization method 300 may be implemented as analgorithm having a polynomial complexity, which may be used to solvelarge sized problems.

As mentioned above, the exemplary optimal categorization method 300 maybe used in pre-processing data and/or post-processing the output of alearning machine. For example, as a pre-processing transformation step,the exemplary optimal categorization method 300 may be used to extractclassification information from raw data. As a post-processingtechnique, the exemplary optimal range categorization method may be usedto determine the optimal cut-off values for markers objectively based ondata, rather than relying on ad hoc approaches. As should be apparent,the exemplary optimal categorization method 300 has applications inpattern recognition, classification, regression problems, etc. Theexemplary optimal categorization method 300 may also be used as astand-alone categorization technique, independent from SVMs and otherlearning machines.

FIG. 4 and the following discussion are intended to provide a brief andgeneral description of a suitable computing environment for implementingbiological data analysis according to the present invention. Althoughthe system shown in FIG. 4 is a conventional personal computer 1000,those skilled in the art will recognize that the invention also may beimplemented using other types of computer system configurations. Thecomputer 1000 includes a central processing unit 1022, a system memory1020, and an Input/Output (“I/O”) bus 1026. A system bus 1021 couplesthe central processing unit 1022 to the system memory 1020. A buscontroller 1023 controls the flow of data on the I/O bus 1026 andbetween the central processing unit 1022 and a variety of internal andexternal I/O devices. The I/O devices connected to the I/O bus 1026 mayhave direct access to the system memory 1020 using a Direct MemoryAccess (“DMA”) controller 1024.

The I/O devices are connected to the I/O bus 1026 via a set of deviceinterfaces. The device interfaces may include both hardware componentsand software components. For instance, a hard disk drive 1030 and afloppy disk drive 1032 for reading or writing removable media 1050 maybe connected to the I/O bus 1026 through disk drive controllers 1040. Anoptical disk drive 1034 for reading or writing optical media 1052 may beconnected to the I/O bus 1026 using a Small Computer System Interface(“SCSI”) 1041. Alternatively, an IDE (Integrated Drive Electronics,i.e., a hard disk drive interface for PCs), ATAPI (ATtAchment PacketInterface, i.e., CD-ROM and tape drive interface), or EIDE (EnhancedIDE) interface may be associated with an optical drive such as may bethe case with a CD-ROM drive. The drives and their associatedcomputer-readable media provide nonvolatile storage for the computer1000. In addition to the computer-readable media described above, othertypes of computer-readable media may also be used, such as ZIP drives,or the like.

A display device 1053, such as a monitor, is connected to the I/O bus1026 via another interface, such as a video adapter 1042. A parallelinterface 1043 connects synchronous peripheral devices, such as a laserprinter 1056, to the I/O bus 1026. A serial interface 1044 connectscommunication devices to the I/O bus 1026. A user may enter commands andinformation into the computer 1000 via the serial interface 1044 or byusing an input device, such as a keyboard 1038, a mouse 1036 or a modem1057. Other peripheral devices (not shown) may also be connected to thecomputer 1000, such as audio input/output devices or image capturedevices.

A number of program modules may be stored on the drives and in thesystem memory 1020. The system memory 1020 can include both RandomAccess Memory (“RAM”) and Read Only Memory (“ROM”). The program modulescontrol how the computer 1000 functions and interacts with the user,with I/O devices or with other computers. Program modules includeroutines, operating systems 1065, application programs, data structures,and other software or firmware components. In an illustrativeembodiment, the learning machine may comprise one or more pre-processingprogram modules 1075A, one or more post-processing program modules1075B, and/or one or more optimal categorization program modules 1077and one or more SVM program modules 1070 stored on the drives or in thesystem memory 1020 of the computer 1000. Specifically, pre-processingprogram modules 1075A, post-processing program modules 1075B, togetherwith the SVM program modules 1070 may comprise computer-executableinstructions for pre-processing data and post-processing output from alearning machine and implementing the learning algorithm according tothe exemplary methods described with reference to FIGS. 1 and 2.Furthermore, optimal categorization program modules 1077 may comprisecomputer-executable instructions for optimally categorizing a data setaccording to the exemplary methods described with reference to FIG. 3.

The computer 1000 may operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer1060. The remote computer 1060 may be a server, a router, a peer deviceor other common network node, and typically includes many or all of theelements described in connection with the computer 1000. In a networkedenvironment, program modules and data may be stored on the remotecomputer 1060. The logical connections depicted in FIG. 4 include alocal area network (“LAN”) 1054 and a wide area network (“WAN”) 1055. Ina LAN environment, a network interface 1045, such as an Ethernet adaptercard, can be used to connect the computer 1000 to the remote computer1060. In a WAN environment, the computer 1000 may use atelecommunications device, such as a modem 1057, to establish aconnection. It will be appreciated that the network connections shownare illustrative and other devices of establishing a communications linkbetween the computers may be used.

In another embodiment, a plurality of SVMs can be configured tohierarchically process multiple data sets in parallel or sequentially.In particular, one or more first-level SVMs may be trained and tested toprocess a first type of data and one or more first-level SVMs can betrained and tested to process a second type of data. Additional types ofdata may be processed by other first-level SVMs. The output from some orall of the first-level SVMs may be combined in a logical manner toproduce an input data set for one or more second-level SVMs. In asimilar fashion, output from a plurality of second-level SVMs may becombined in a logical manner to produce input data for one or morethird-level SVM. The hierarchy of SVMs may be expanded to any number oflevels as may be appropriate. In this manner, lower hierarchical levelSVMs may be used to pre-process data that is to be input into higherlevel SVMs. Also, higher hierarchical level SVMs may be used topost-process data that is output from lower hierarchical level SVMs.

Each SVM in the hierarchy or each hierarchical level of SVMs may beconfigured with a distinct kernel. For example, SVMs used to process afirst type of data may be configured with a first type of kernel whileSVMs used to process a second type of data may utilize a second,different type of kernel. In addition, multiple SVMs in the same ordifferent hierarchical level may be configured to process the same typeof data using distinct kernels.

FIG. 5 illustrates an exemplary hierarchical system of SVMs. As shown,one or more first-level SVMs 1302 a and 1302 b may be trained and testedto process a first type of input data 1304 a, such as mammography data,pertaining to a sample of medical patients. One or more of these SVMsmay comprise a distinct kernel, indicated as “KERNEL 1” and “KERNEL 2”.Also, one or more additional first-level SVMs 1302 c and 1302 d may betrained and tested to process a second type of data 1304 b, which maybe, for example, genomic data for the same or a different sample ofmedical patients. Again, one or more of the additional SVMs may comprisea distinct kernel, indicated as “KERNEL 1” and “KERNEL 3”. The outputfrom each of the like first-level SVMs may be compared with each other,e.g., 1306 a compared with 1306 b; 1306 c compared with 1306 d, in orderto determine optimal outputs 1308 a and 1308 b. Then, the optimaloutputs from the two groups or first-level SVMs, i.e., outputs 1308 aand 1308 b, may be combined to form a new multi-dimensional input dataset 1310, for example, relating to mammography and genomic data. The newdata set may then be processed by one or more appropriately trained andtested second-level SVMs 1312 a and 1312 b. The resulting outputs 1314 aand 1314 b from second-level SVMs 1312 a and 1312 b may be compared todetermine an optimal output 1316. Optimal output 1316 may identifycausal relationships between the mammography and genomic data points. Asshould be apparent to those of skill in the art, other combinations ofhierarchical SVMs may be used to process either in parallel or serially,data of different types in any field or industry in which analysis ofdata is desired.

Feature Selection:

Feature Selection by Recursive Feature Elimination.

The problem of selection of a small amount of data from a large datasource, such as a gene subset from a microarray, is particularly solvedusing the methods, devices and systems described herein. Previousattempts to address this problem used correlation techniques, i.e.,assigning a coefficient to the strength of association betweenvariables. In a first embodiment described herein, support vectormachines methods based on recursive feature elimination (RFE) are used.In examining genetic data to find determinative genes, these methodseliminate gene redundancy automatically and yield better and morecompact gene subsets. The methods, devices and systems described hereincan be used with publicly-available data to find relevant answers, suchas genes determinative of a cancer diagnosis, or with specificallygenerated data.

The illustrative examples are directed at gene expression datamanipulations, however, any data can be used in the methods, systems anddevices described herein. There are studies of gene clusters discoveredby unsupervised or supervised learning techniques. Preferred methodscomprise application of SVMs in determining a small subset of highlydiscriminant genes that can be used to build very reliable cancerclassifiers. Identification of discriminant genes is beneficial inconfirming recent discoveries in research or in suggesting avenues forresearch or treatment. Diagnostic tests that measure the abundance of agiven protein in bodily fluids may be derived from the discovery of asmall subset of discriminant genes.

In classification methods using SVMs, the input is a vector referred toas a “pattern” of n components referred to as “features”. F is definedas the n-dimensional feature space. In the examples given, the featuresare gene expression coefficients and the patterns correspond topatients. While the present discussion is directed to two-classclassification problems, this is not to limit the scope of theinvention. The two classes are identified with the symbols (+) and (−).A training set of a number of patterns {x₁, x₂, . . . . x_(k), . . . .x_(l)} with known class labels {y₁, y₂, . . . y_(k), . . . . y_(l)},y_(k)ε{−1,+1}, is given. The training patterns are used to build adecision function (or discriminant function) D(x), that is a scalarfunction of an input pattern x. New patterns are classified according tothe sign of the decision function:D(x)>0

xεclass (+);D(x)<0

xεclass (−);D(x)=0, decision boundary;where ε means “is a member of”.Decision boundaries that are simple weighted sums of the trainingpatterns plus a bias are referred to as “linear discriminant functions”,e.g.,D(x)=w·x+b,  (1)where w is the weight vector and b is a bias value. A data set is saidto be linearly separable if a linear discriminant function can separateit without error.

Feature selection in large dimensional input spaces is performed usinggreedy algorithms and feature ranking. A fixed number of top rankedfeatures may be selected for further analysis or to design a classifier.Alternatively, a threshold can be set on the ranking criterion. Only thefeatures whose criterion exceed the threshold are retained. A preferredmethod uses the ranking to define nested subsets of features, F₁⊂F₂⊂ . .. ⊂F, and select an optimum subset of features with a model selectioncriterion by varying a single parameter: the number of features.

Errorless separation can be achieved with any number of genes greaterthan one. Preferred methods comprise use of a smaller number of genes.Classical gene selection methods select the genes that individually bestclassify the training data. These methods include correlation methodsand expression ratio methods. While the classical methods eliminategenes that are useless for discrimination (noise), they do not yieldcompact gene sets because genes are redundant. Moreover, complementarygenes that individually do not separate well are missed.

A simple feature ranking can be produced by evaluating how well anindividual feature contributes to the separation (e.g. cancer vs.normal). Various correlation coefficients have been proposed as rankingcriteria. For example, see, T. K. Golub, et al, “Molecularclassification of cancer: Class discovery and class prediction by geneexpression monitoring”, Science 286, 531-37 (1999). The coefficient usedby Golub et al. is defined as:w _(i)=(μ_(i)(+)−μ_(i)(−))/(σ_(i)(+)+σ_(i)(−))  (2)where μ_(i) and σ_(i) are the mean and standard deviation, respectively,of the gene expression values of a particular gene i for all thepatients of class (+) or class (−), i=1, . . . n. Large positive w_(i)values indicate strong correlation with class (+) whereas large negativew_(i) values indicate strong correlation with class (−). The methoddescribed by Golub, et al. for feature ranking is to select an equalnumber of genes with positive and with negative correlation coefficient.Other methods use the absolute value of w_(i) as ranking criterion, or arelated coefficient,(μ_(i)(+)−μ_(i)(−))²/(σ_(i)(+)²+σ_(i)(−)²).  (3)

What characterizes feature ranking with correlation methods is theimplicit orthogonality assumptions that are made. Each coefficient w_(i)is computed with information about a single feature (gene) and does nottake into account mutual information between features.

One use of feature ranking is in the design of a class predictor (orclassifier) based on a pre-selected subset of genes. Each feature whichis correlated (or anti-correlated) with the separation of interest is byitself such a class predictor, albeit an imperfect one. A simple methodof classification comprises a method based on weighted voting: thefeatures vote in proportion to their correlation coefficient. Such isthe method used by Golub, et al. The weighted voting scheme yields aparticular linear discriminant classifier:D(x)=w·(x−μ),  (4)where w is w_(i)=(μ_(i)(+)−μ_(i)(−))/(σ_(i)(+)+σ_(i)(−)) andμ=(μ(+)+μ(−))/2

Another classifier or class predictor is Fisher's linear discriminant.Such a classifier is similar to that of Golub et al. wherew=S ⁻¹(μ(+)+μ(−)),  (5)where S is the (n,n) within class scatter matrix defined as

$\begin{matrix}{{S = {{\sum\limits_{x \in \;{X{( + )}}}{\left( {x - {\mu( + )}} \right)\left( {x - {\mu( + )}} \right)^{T}}} + {\sum\limits_{x \in \;{X{( - )}}}x} - {{\mu( - )}\left. \quad \right)\left( {x - {\mu( - )}} \right)^{T}}}},} & (6)\end{matrix}$where μ is the mean vector over all training patters and X(+) and X(−)are the training sets of class (+) and (−), respectively. This form ofFisher's discriminant implies that S is invertible, however, this is notthe case if the number of features n is larger than the number ofexamples l since the rank of S is at most l. The classifiers ofEquations 4 and 6 are similar if the scatter matrix is approximated byits diagonal elements. This approximation is exact when the vectorsformed by the values of one feature across all training patterns areorthogonal, after subtracting the class mean. The approximation retainssome validity if the features are uncorreclated, that is, if theexpected value of the product of two different features is zero, afterremoving the class mean. Approximating S by its diagonal elements is oneway of regularizing it (making it invertible). However, features usuallyare correlated and, therefore, the diagonal approximation is not valid.

One aspect of the present invention comprises using the feature rankingcoefficients as classifier weights. Reciprocally, the weightsmultiplying the inputs of a given classifier can be used as featureranking coefficients. The inputs that are weighted by the largest valueshave the most influence in the classification decision. Therefore, ifthe classifier performs well, those inputs with largest weightscorrespond to the most informative features, or in this instance, genes.Other methods, known as multivariate classifiers, comprise algorithms totrain linear discriminant functions that provide superior featureranking compared to correlation coefficients. Multivariate classifiers,such as the Fisher's linear discriminant (a combination of multipleunivariate classifiers) and methods disclosed herein, are optimizedduring training to handle multiple variables or features simultaneously.

For classification problems, the ideal objective function is theexpected value of the error, i.e., the error rate computed on aninfinite number of examples. For training purposes, this ideal objectiveis replaced by a cost function J computed on training examples only.Such a cost function is usually a bound or an approximation of the idealobjective, selected for convenience and efficiency. For linear SVMs, thecost function is:J=(½)∥w∥ ²,  (7)which is minimized, under constraints, during training. The criteria(w_(i))² estimates the effect on the objective (cost) function ofremoving feature i.

A good feature ranking criterion is not necessarily a good criterion forranking feature subsets. Some criteria estimate the effect on theobjective function of removing one feature at a time. These criteriabecome suboptimal when several features are removed at one time, whichis necessary to obtain a small feature subset.

Recursive Feature Elimination (RFE) methods can be used to overcome thisproblem. RFE methods comprise iteratively 1) training the classifier ,2) computing the ranking criterion for all features, and 3) removing thefeature having the smallest ranking criterion. This iterative procedureis an example of backward feature elimination. For computationalreasons, it may be more efficient to remove several features at a timeat the expense of possible classification performance degradation. Insuch a case, the method produces a “feature subset ranking”, as opposedto a “feature ranking”. Feature subsets are nested, e.g., F₁⊂F₂⊂ . . .⊂F.

If features are removed one at a time, this results in a correspondingfeature ranking. However, the features that are top ranked, i.e.,eliminated last, are not necessarily the ones that are individually mostrelevant. It may be the case that the features of a subset F_(m) areoptimal in some sense only when taken in some combination. RFE has noeffect on correlation methods since the ranking criterion is computedusing information about a single feature.

In the present embodiment, the weights of a classifier are used toproduce a feature ranking with a SVM (Support Vector Machine). Thepresent invention contemplates methods of SVMs used for both linear andnon-linear decision boundaries of arbitrary complexity, however, theexample provided herein is directed to linear SVMs because of the natureof the data set under investigation. Linear SVMs are particular lineardiscriminant classifiers. (See Equation 1). If the training set islinearly separable, a linear SVM is a maximum margin classifier. Thedecision boundary (a straight line in the case of a two-dimensionseparation) is positioned to leave the largest possible margin on eitherside. One quality of SVMs is that the weights w_(i) of the decisionfunction D(x) are a function only of a small subset of the trainingexamples, i.e., “support vectors”. Support vectors are the examples thatare closest to the decision boundary and lie on the margin. Theexistence of such support vectors is at the origin of the computationalproperties of SVM and its competitive classification performance. WhileSVMs base their decision function on the support vectors that are theborderline cases, other methods such as the previously-described methodof Golub, et al., base the decision function on the average case.

A preferred method of the present invention comprises using a variant ofthe soft-margin algorithm where training comprises executing a quadraticprogram as described by Cortes and Vapnik in “Support vector networks”,1995, Machine Learning, 20:3, 273-297, which is incorporated herein byreference in its entirety. The following is provided as an example,however, different programs are contemplated by the present inventionand can be determined by those skilled in the art for the particulardata sets involved.

Inputs comprise training examples (vectors) {x₁, x₁, . . . . x_(k) . . .x_(l)} and class labels {y₁, y₂ . . . . y_(k) . . . y_(l)}. To identifythe optimal hyperplane, the following quadratic program is executed:

$\begin{matrix}\left\{ \begin{matrix}{{Minimize}\mspace{14mu}{over}\mspace{14mu}\alpha_{k}\text{:}} \\{J = {{\left( {1/2} \right){\sum\limits_{hk}{y_{h}y_{k}\alpha_{h}{\alpha_{k}\left( {{x_{h} \cdot x_{k}} + {\lambda\delta}_{hk}} \right)}}}} - {\sum\limits_{k}\alpha_{k}}}} \\{{subject}\mspace{14mu}{to}\text{:}} \\{{0 \leq \alpha_{k} \leq {C\mspace{14mu}{and}\mspace{14mu}{\sum\limits_{k}{\alpha_{k}y_{k}}}}} = 0}\end{matrix}\quad \right. & (8)\end{matrix}$with the resulting outputs being the parameters α_(k)., where thesummations run over all training patterns x_(k) that are n dimensionalfeature vectors, x_(h)·x_(k) denotes the scalar product, y_(k) encodesthe class label as a binary value=1 or −1, δ_(hk) is the Kroneckersymbol (δ_(hk)=1 if h=k and 0 otherwise), and λ and C are positiveconstants (soft margin parameters). The soft margin parameters ensureconvergence even when the problem is non-linearly separable or poorlyconditioned. In such cases, some support vectors may not lie on themargin. Methods include relying on λ or C, but preferred methods, andthose used in the Examples below, use a small value of λ (on the orderof 10⁻¹⁴) to ensure numerical stability. For the Examples providedherein, the solution is rather insensitive to the value of C because thetraining data sets are linearly separable down to only a few features. Avalue of C=100 is adequate, however, other methods may use other valuesof C.

The resulting decision function of an input vector x is:

$\begin{matrix}{{{D(x)} = {{w \cdot x} + b}}{with}w = {{\sum\limits_{k}{\alpha_{k}y_{k}x_{k}\mspace{20mu}{and}\mspace{20mu} b}} = \left\langle {y_{k} - {w \cdot x_{k}}} \right\rangle}} & (9)\end{matrix}$The weight vector w is a linear combination of training patterns. Mostweights α_(k) are zero. The training patterns with non-zero weights aresupport vectors. Those having a weight that satisfies the strictinequality 0<α_(k)<C are marginal support vectors. The bias value b isan average over marginal support vectors.

The following sequence illustrates application of recursive featureelimination (RFE) to a SVM using the weight magnitude as the rankingcriterion. The inputs are training examples (vectors) : X₀=[x₁, x₂, . .. . x_(k) . . . x_(l)]^(T) and class labels Y=[y₁, y₂ . . . . y_(k) . .. y_(l)]^(T).

Initalize:

Subset of surviving features

-   -   s=[1,2, . . . n]

Features ranked list

-   -   r=[ ]

Repeat until s=[ ]

Restrict training examples to good feature indices

-   -   X=X₀(:,s)

Train the classifier

-   -   α=SVM train(X,y)

Compute the weight vector of dimension length(s):

$w = {\sum\limits_{k}{\alpha_{k}y_{k}x_{k}}}$

-   -   Compute the ranking criteria    -   c_(i)=(w_(i))², for all i

Find the feature with smallest ranking criterion

-   -   f=argmin(c)

Update feature ranked list

-   -   r=[s(f),r]

Eliminate the feature with smallest ranking criterion

-   -   s=s(1:f−1, f=1:length (s))

The output comprises feature ranked list r.

The above steps can be modified to increase computing speed bygeneralizing the algorithm to remove more than one feature per step.

In general, RFE is computationally expensive when compared againstcorrelation methods, where several thousands of input data points can beranked in about one second using a Pentium® processor, and weights ofthe classifier trained only once with all features, such as SVMs orpseudo-inverse/mean squared error (MSE). A SVM implemented usingnon-optimized MatLab® code on a Pentium® processor can provide asolution in a few seconds. To increase computational speed, RFE ispreferrably implemented by training multiple classifiers on subsets offeatures of decreasing size. Training time scales linearly with thenumber of classifiers to be trained. The trade-off is computational timeversus accuracy. Use of RFE provides better feature selection than canbe obtained by using the weights of a single classifier. Better resultsare also obtained by eliminating one feature at a time as opposed toeliminating chunks of features. However, significant differences areseen only for a smaller subset of features such as fewer than 100.Without trading accuracy for speed, RFE can be used by removing chunksof features in the first few iterations and then, in later iterations,removing one feature at a time once the feature set reaches a fewhundreds. RFE can be used when the number of features, e.g., genes, isincreased to millions. In one example, at the first iteration, thenumber of genes were reached that was the closest power of two. Atsubsequent iterations, half of the remaining genes were eliminated, suchthat each iteration was reduced by a power of two. Nested subsets ofgenes were obtained that had increasing information density.

RFE consistently outperforms the naive ranking, particularly for smallfeature subsets. (The naïve ranking comprises ranking the features with(w_(i))², which is computationally equivalent to the first iteration ofRFE.) The naïve ranking organizes features according to their individualrelevance, while RFE ranking is a feature subset ranking. The nestedfeature subsets contain complementary features that individually are notnecessarily the most relevant. An important aspect of SVM featureselection is that clean data is most preferred because outliers play anessential role. The selection of useful patterns, support vectors, andselection of useful features are connected.

Pre-processing can have a strong impact on SVM-RFE. In particular,feature scales must be comparable. One pre-processing method is tosubtract the mean of a feature from each feature, then divide the resultby its standard deviation. Such pre-processing is not necessary ifscaling is taken into account in the computational cost function.Another pre-processing operation can be performed to reduce skew in thedata distribution and provide more uniform distribution. Thispre-processing step involves taking the log of the value, which isparticularly advantageous when the data consists of gene expressioncoefficients, which are often obtained by computing the ratio of twovalues. For example, in a competitive hybridization scheme, DNA from twosamples that are labeled differently are hybridized onto the array. Oneobtains at every point of the array two coefficients corresponding tothe fluorescence of the two labels and reflecting the fraction of DNA ofeither sample that hybridized to the particular gene. Typically, thefirst initial preprocessing step that is taken is to take the ratio a/bof these two values. Though this initial preprocessing step is adequate,it may not be optimal when the two values are small. Other initialpreprocessing steps include (a−b)/(a+b) and (log a−log b)/(log a+log b).

Another pre-processing step involved normalizing the data across allsamples by subtracting the mean. This preprocessing step is supported bythe fact that, using tissue samples, there are variations inexperimental conditions from microarray to microarray. Although standarddeviation seems to remain fairly constant, the other preprocessing stepselected was to divide the gene expression values by the standarddeviation to obtain centered data of standardized variance.

To normalize each gene expression across multiple tissue samples, themean expression value and standard deviation for each gene was computed.For all the tissue sample values of that gene (training and test), thatmean was then subtracted and the resultant value was divided by thestandard deviation. In some experiments, an additional preprocessingstep was added by passing the data through a squashing function [∫(x)=cantan (x/c)] to diminish the importance of the outliers.

In a variation on several of the preceding pre-processing methods, thedata can be pre-processed by a simple “whitening” to make data matrixresemble “white noise.” The samples can be pre-processed to: normalizematrix columns; normalize matrix lines; and normalize columns again.Normalization consists of subtracting the mean and dividing by thestandard deviation. A further normalization step can be taken when thesamples are split into a training set and a test set.

The mean and variance column-wise was computed for the training samplesonly. All samples (training and test samples) were then normalized bysubtracting that mean and dividing by the standard deviation.

In addition to the above-described linear example, SVM-RFE can be usedin nonlinear cases and other kernel methods. The method of eliminatingfeatures on the basis of the smallest change in cost function can beextended to nonlinear uses and to all kernel methods in general.Computations can be made tractable by assuming no change in the value ofthe α's. Thus, the classifer need not be retrained for every candidatefeature to be eliminated.

Specifically, in the case of SVMs, the cost function to be minimized(under the constraints 0≦α_(k)≦C and Σ_(k)α_(k)λ_(k)=0) is:J=(½)α^(T) Hα−α ^(T)1,  (10)where H is the matrix with elements γ_(h)γ_(k)K(x_(h)x_(k)), K is akernel function that measures the similarity between x_(h) and x_(k),and 1 is an l dimensional vector of ones.

An example of such a kernel function isK(x _(h) x _(k))=exp(−γ∥x _(h) −x _(k)∥²).  (11)

To compute the change in cost function caused by removing inputcomponent i, one leaves the α's unchanged and recomputes matrix H. Thiscorresponds to computing K(x_(h)(−i), x_(k)(−i), yielding matrix H(−i),where the notation (−i) means that component i has been removed. Theresulting ranking coefficient is:DJ(i)=(½)α^(T) Hα−(½)α^(T) H(−i)α  (12)The input corresponding to the smallest difference DJ(i) is thenremoved. The procedure is repeated to carry out Recursive FeatureElimination (RFE).

The advantages of RFE are further illustrated by the following example,which is not to be construed in any way as imposing limitations upon thescope thereof On the contrary, it is to be clearly understood thatresort may be had to various other embodiments, modifications, andequivalents thereof which, after reading the description herein, maysuggest themselves to those skilled in the art without departing fromthe spirit of the present invention and/or the scope of the appendedclaims.

EXAMPLE 1 Analysis of Gene Patterns Related to Colon Cancer

Analysis of data from diagnostic genetic testing, microarray data ofgene expression vectors, was performed with a SVM-RFE. The original datafor this example was derived from the data presented in Alon et al.,1999. Gene expression information was extracted from microarray dataresulting, after pre-processing, in a table of 62 tissues ×2000 genes.The 62 tissues include 22 normal tissues and 40 colon cancer tissues.The matrix contains the expression of the 2000 genes with highestminimal intensity across the 62 tissues. Some of the genes are non-humangenes.

The data proved to be relatively easy to separate. After preprocessing,it was possible to a find a weighted sum of a set of only a few genesthat separated without error the entire data set, thus the data set waslinearly separable. One problem in the colon cancer data set was thattumor samples and normal samples differed in cell composition. Tumorsamples were normally rich in epithelial cells wherein normal sampleswere a mixture of cell types, including a large fraction of smoothmuscle cells. While the samples could be easily separated on the basisof cell composition, this separation was not very informative fortracking cancer-related genes.

The gene selection method using RFE-SVM is compared against a referencegene selection method described in Golub et al, Science, 1999, which isreferred to as the “baseline method” Since there was no defined trainingand test set, the data was randomly split into 31 samples for trainingand 31 samples for testing.

In Golub et al., the authors use several metrics of classifier quality,including error rate, rejection rate at fixed threshold, andclassification confidence. Each value is computed both on theindependent test set and using the leave-one-out method on the trainingset. The leave-one-out method consists of removing one example from thetraining set, constructing the decision function on the basis only ofthe remaining training data and then testing on the removed example. Inthis method, one tests all examples of the training data and measuresthe fraction of errors over the total number of training examples.

The methods of this Example uses a modification of the above metrics.The present classification methods use various decision functions (D(x)whose inputs are gene expression coefficients and whose outputs are asigned number indicative of whether or not cancer was present. Theclassification decision is carried out according to the sign of D(x).The magnitude of D(x) is indicative of classification confidence.

Four metrics of classifier quality were used: (1) Error (B1+B2)=numberof errors (“bad”) at zero rejection; (2) Reject (R1+R2)=minimum numberof rejected samples to obtain zero error; Extremal margin(E/D)=difference between the smallest output of the positive classsamples and the largest output of the negative class samples (rescaledby the largest difference between outputs); and Median margin(M/D)=difference between the median output of the positive class samplesand the median output of the negative class samples (rescaled by thelargest difference between outputs). Each value is computed both on thetraining set with the leave-one-out method and on the test set.

The error rate is the fraction of examples that are misclassified(corresponding to a diagnostic error). The error rate is complemented bythe success rate. The rejection rate is the fraction of examples thatare rejected (on which no decision is made because of low confidence).The rejection rate is complemented by the acceptance rate. Extremal andmedian margins are measurements of classification confidence. Note thatthe margin computed with the leave-one-out method or on the test setdiffers from the margin computed on training examples sometimes used inmodel selection criteria.

A method for predicting the optimum subset of genes comprised defining acriterion of optimality that uses information derived from trainingexamples only. This criterion was checked by determining whether thepredicted gene subset performed best on the test set.

A criterion that is often used in similar “model selection” problems isthe leave-one-out success rate V_(suc). In the present example, it wasof little use since differentiation between many classifiers that havezero leave-one-out error is not allowed. Such differentiation isobtained by using a criterion that combines all of the quality metricscomputed by cross-validation with the leave-one-out method:Q=V _(suc) +V _(acc) +V _(ext) +V _(med)  (13)where V_(suc) is the success rate, V_(acc) the acceptance rate, V_(ext)the extremal margin, and V_(med) is the median margin.

Theoretical considerations suggested modification of this criterion topenalize large gene sets. The probability of observing large differencesbetween the leave-one-out error and the test error increases with thesize d of the gene set, according toε(d)=sqrt(−log(α)+log(G(d))).sqrt(p(1−p)/n)  (14)where (1−α) is the confidence (typically 95%, i.e., α=0.05), p is the“true” error rate (p<=0.01), and n is the size of the training set.

Following the guaranteed risk principle, a quantity proportional to ε(d)was subtracted from criterion Q to obtain a new criterion:C=Q−2ε(d)  (15)

The coefficient of proportionality was computed heuristically, assumingthat V_(suc), V_(acc), V_(ext) and V_(med) are independent randomvariables with the same error bar ε(d) and that this error bar iscommensurate to a standard deviation. In this case, variances would beadditive, therefore, the error bar should be multiplied by sqrt(4).

A SVM-RFE was run on the raw data to assess the validity of the method.The colon cancer data samples were split randomly into 31 examples fortraining and 31 examples for testing. The RFE method was run toprogressively downsize the number of genes, each time dividing thenumber by 2. The pre-processing of the data for each gene expressionvalue consisted of subtracting the mean from the value, then dividingthe result by the standard deviation.

The leave-one-out method with the classifier quality criterion was usedto estimate the optimum number of genes. The leave-one-out methodcomprises taking out one example of the training set. Training is thenperformed on the remaining examples, with the left out example beingused to test the trained classifier. This procedure is iterated over allthe examples. Each criteria is computed as an average over all examples.The overall classifier quality criterion is calculated according toEquation 13. The classifier is a linear classifier with hard margin.

Results of the SVM-RFE as taught herein show that at the optimumpredicted by the method using training data only, the leave-one-outerror is zero and the test performance is actually optimum.

The optimum test performance had an 81% success rate withoutpre-processing to remove skew and to normalize the data. This result wasconsistent with the results reported in the original paper by Alon etal. Moreover, the errors, except for one, were identified by Alon et al.as outliers. The plot of the performance curves as a function of genenumber is shown in FIG. 6. The predictor of optimum test success rate(diamond curve), which is obtained by smoothing after substracting εfrom the leave-one-out quality criterion, coincides with the actual testsuccess rate (circle curve) in finding the optimum number of genes to be4.

When used in conjunction with pre-processing according to thedescription above to remove skew and normalize across samples, a SVM-RFEprovided further improvement. FIG. 7 shows the results of RFE afterpreprocessing, where the predicted optimum test success rate is achievedwith 16 genes. The reduced capacity SVM used in FIG. 6 is replaced byplain SVM. Although a log₂ scale is still used for gene number, RFE wasrun by eliminating one gene at a time. The best test performance is 90%classification accuracy (8 genes). The optimum number of genes predictedfrom the classifier quality based on training data information only is16. This corresponds to 87% classification accuracy on the test set.

Because of data redundancy, it was possible to find many subsets ofgenes that provide a reasonable separation. To analyze the results, therelatedness of the genes should be understand. While not wishing to bebound by any particular theory, it was the initial theory that theproblem of gene selection was to find an optimum number of genes,preferably small, that separates normal tissues from cancer tissues withmaximum accuracy.

SVM-RFE used a subset of genes that were complementary and thus carriedlittle redundant information. No other information on the structure andnature of the data was provided. Because data were very redundant, agene that had not been selected may nevertheless be informative for theseparation.

Correlation methods such as Golub's method provide a ranked list ofgenes. The rank order characterizes how correlated the gene is with theseparation. Generally, a gene highly ranked taken alone provides abetter separation than a lower ranked gene. It is therefore possible toset a threshold (e.g. keep only the top ranked genes) that separates“highly informative genes” from “less informative genes”.

The methods of the present invention such as SVM-RFE provide subsets ofgenes that are both smaller and more discriminant. The gene selectionmethod using SVM-RFE also provides a ranked list of genes. With thislist, nested subsets of genes of increasing sizes can be defined.However, the fact that one gene has a higher rank than another gene doesnot mean that this one factor alone characterizes the better separation.In fact, genes that are eliminated in an early iteration could well bevery informative but redundant with others that were kept. The 32 bestgenes as a whole provide a good separation but individually may not bevery correlated with the target separation. Gene ranking allows for abuilding nested subsets of genes that provide good separations, howeverit provides no information as to how good an individual gene may be.Genes of any rank may be correlated with the 32 best genes. Thecorrelated genes may be ruled out at some point because of theirredundancy with some of the remaining genes, not because they did notcarry information relative to the target separation.

The gene ranking alone is insufficient to characterize which genes areinformative and which ones are not, and also to determine which genesare complementary and which ones are redundant. Therefore, additionalpre-processing in the form of clustering was performed.

To overcome the problems of gene ranking alone, the data waspreprocessed with an unsupervised clustering method. Genes were groupedaccording to resemblance (according to a given metric). Cluster centerswere then used instead of genes themselves and processed by SVM-RFE toproduce nested subsets of cluster centers. An optimum subset size can bechosen with the same cross-validation method used before.

Using the data, the QT_(clust) clustering algorithm was used to produce100 dense clusters. (The “quality clustering algorithm” (QT_(clust)) iswell known to those of skill in the field of analysis of gene expressionprofiles.) The similarity measure used was Pearson's correlationcoefficient (as commonly used for gene clustering). FIG. 8 provides theperformance curves of the results of RFE when trained on 100 denseQT_(clust) clusters. As indicated, the predicted optimum number of genecluster centers is 8. The results of this analysis are comparable tothose of FIG. 7.

With unsupervised clustering, a set of informative genes is defined, butthere is no guarantee that the genes not retained do not carryinformation. When RFE was used on all QT_(clust) clusters plus theremaining non-clustered genes (singleton clusters), the performancecurves were quite similar, though the top set of gene clusters selectedwas completely different and included mostly singletons.

The cluster centers can be substituted by any of their members. Thisfactor may be important in the design of some medical diagnosis tests.For example, the administration of some proteins may be easier than thatof others. Having a choice of alternative genes introduces flexibilityin the treatment and administration choices.

Hierarchical clustering instead of QT_(clust) clustering was used toproduce lots of small clusters containing 2 elements on average. Becauseof the smaller cluster cardinality, there were fewer gene alternativesfrom which to choose. In this instance, hierarchical clustering did notyield as good a result as using QT_(clust) clustering. The presentinvention contemplates use of any of the known methods for clustering,including but not limited to hierarchical clustering, QT_(clust)clustering and SVM clustering. The choice of which clustering method toemploy in the invention is affected by the initial data and the outcomedesired, and can be determined by those skilled in the art.

Another method used with the present invention was to use clustering asa post-processing step of SVM-RFE. Each gene selected by running regularSVM-RFE on the original set of gene expression coefficients was used asa cluster center. For example, the results described with reference toFIG. 7 were used. For each of the top eight genes, the correlationcoefficient was computed with all remaining genes. The parameters werethat the genes clustered to gene i were those that met the following twoconditions: higher correlation coefficient with gene i than with othergenes in the selected subset of eight genes, and correlation coefficientexceeding a threshold θ.

Compared to the unsupervised clustering method and results, thesupervised clustering method, in this instance, does not provide bettercontrol over the number of examples per cluster. Therefore, this methodis not as good as unsupervised clustering if the goal is the ability toselect from a variety of genes in each cluster. However, supervisedclustering may show specific clusters that have relevance for thespecific knowledge being determined. In this particular embodiment, inparticular, a very large cluster of genes was found that containedseveral muscle genes that may be related to tissue composition and maynot be relevant to the cancer vs. normal separation. Thus, those genesare good candidates for elimination from consideration as having littlebearing on the diagnosis or prognosis for colon cancer.

An additional pre-processing operation involved the use of expertknowledge to eliminating data that are known to complicate analysis dueto the difficulty in differentiating the data from other data that isknown to be useful. In the present Example, tissue composition-relatedgenes were automatically eliminated in the pre-processing step bysearching for the phrase “smooth muscle”. Other means for searching thedata for indicators of the smooth muscle genes may be used.

The number of genes selected by Recursive Feature Elimination (RFE) wasvaried and was tested with different methods. Training was done on theentire data set of 62 samples. The curves represent the leave-one-outsuccess rate. For comparison, the results obtained using severaldifferent methods for gene selection from colon cancer data are providedin FIG. 9. SVM-RFE is compared to Linear Discriminant Analysis(LDA)-RFE; Mean Squared Error (Pseudo-inverse)-(MSE)-RFE and thebaseline method (Golub, 1999). As indicated, SVM-RFE provides the bestresults down to 4 genes. An examination of the genes selected revealsthat SVM eliminates genes that are tissue composition-related and keepsonly genes that are relevant to the cancer vs. normal separation.Conversely, other methods retain smooth muscle genes in their top rankedgenes which aids in separating most samples, but is not relevant to thecancer vs. normal discrimination.

All methods that do not make independent assumptions outperform Golub'smethod and reach 100% leave-one-out accuracy for at least one value ofthe number of genes. LDA may be at a slight disadvantage on these plotsbecause, for computational reasons, RFE was used by eliminating chunksof genes that decrease in size by powers of two. Other methods use RFEby eliminating one gene at a time.

Down to four genes, SVM-RFE provided better performance than the othermethods. All methods predicted an optimum number of genes smaller orequal to 64 using the criterion of the Equation 15. The genes ranking 1through 64 for all the methods studied were compared. The first genethat was related to tissue composition and mentions “smooth muscle” inits description ranks 5 for Golub's method, 4 for LDA, 1 for MSE andonly 41 for SVM. Therefore, this was a strong indication that SVMs makea better use of the data compared with other methods since they are theonly methods that effectively factors out tissue composition-relatedgenes while providing highly accurate separations with a small subset ofgenes.

FIG. 10 is a plot of an optimum number of genes for evaluation of coloncancer data using RFE-SVM. The number of genes selected by recursivegene elimination with SVMs was varied and a number of quality metricswere evaluated include error rate on the test set, scaled qualitycriterion (Q/4), scaled criterion of optimality (C/4), locally smoothedC/4 and scaled theoretical error bar (ε/2). The curves are related byC=Q−2ε.

The model selection criterion was used in a variety of other experimentsusing SVMs and other algorithms. The optimum number of genes was alwayspredicted accurately, within a factor of two of the number of genes.

Feature Selection by Minimizing l₀-norm.

A second method of feature selection according to the present inventioncomprises minimizing the l₀-norm of parameter vectors. Such a procedureis central to many tasks in machine learning, including featureselection, vector quantization and compression methods. This methodconstructs a classifier which separates data using the smallest possiblenumber of features. Specifically, the l₀-norm of w is minimized bysolving the optimization problem

$\begin{matrix}{{{\min\limits_{w}{{w}_{0}\mspace{20mu}{subject}\mspace{14mu}{to}\text{:}\mspace{14mu}{y_{i}\left( {\left\langle {w,x_{i}} \right\rangle + b} \right)}}} \geq 1},} & (16)\end{matrix}$where ∥w∥₀=card {w_(i|)w_(i)≠0}. In other words, the goal is to find thefewest non-zero elements in the vector of coefficients w. However,because this problem is combinatorially hard, the followingapproximation is used:

$\begin{matrix}{{\min\limits_{w}{\sum\limits_{j = 1}^{n}{{\ln\left( {ɛ + {w_{j}}} \right)}\mspace{14mu}{subject}\mspace{14mu}{to}\text{:}}}}{{y_{i}\left( {\left\langle {w,x_{i}} \right\rangle + b} \right)} \geq 1}} & (17)\end{matrix}$where ε<<1 has been introduced in order to avoid ill-posed problemswhere one of the w_(j) is zero. Because there are many local minima,Equation 17 can be solved by using constrained gradient.

Let w₁(ε), also written w₁ when the context is clear, be the minimizerof Equation 17, and w₀ the minimizer of Equation 16, which provides

$\begin{matrix}{{\sum\limits_{j = 1}^{n}{\ln\left( {ɛ + {\left( w_{l} \right)_{j}}} \right)}} \leq {\sum\limits_{j = 1}^{n}{\ln\left( {ɛ + {\left( w_{o} \right)_{j}}} \right)}}} & (18) \\{\leq {{\sum\limits_{{(w_{o})}_{j} = 0}{\ln(ɛ)}} + {\sum\limits_{{(w_{o})}_{j} \neq 0}{\ln\;\left( {ɛ + {\left( w_{o} \right)_{j}}} \right)}}}} & (19) \\{\leq {{\left( {n - {w_{0}}_{0}} \right){\ln(ɛ)}} + {\sum{\ln\;{\left( {ɛ + {\left( w_{o} \right)_{j}}} \right).}}}}} & (20)\end{matrix}$The second term of the right hand side of Equation 20 is negligiblecompared to the ln(ε) term when ε is very small. Thus,

$\begin{matrix}{{{\left( {n - {w_{i}}_{0}} \right){\ln(ɛ)}} + {\sum\limits_{{(w_{i})}_{j} \neq 0}{\ln\left( {ɛ + {\left( w_{i} \right)_{j}}} \right)}}} \leq {\left( {n - {w_{0}}_{0}} \right){\ln(ɛ)}}} & (21) \\{{w_{i}}_{0} \leq {{w_{0}}_{0} - {\sum\limits_{{(w_{i})}_{j} \neq 0}{\frac{\ln\left( {ɛ + {\left( w_{i} \right)_{j}}} \right)}{\ln\left( {1/ɛ} \right)}.}}}} & (22)\end{matrix}$

Depending on the value of (w₁(ε))_(j), the sum on the right-hand sidecan be large or small when ε→0. This will depend mainly on the problemat hand. Note, however, that if ε is very small, for example, if εequals the machine precision, then as soon as (w₁)_(j) is Ω(1), the zeronorm of w₁ is the same as the zero norm of w₀.

The foregoing supports that fact that for the objective problem ofEquation 17 it is better to set w_(j) to zero whenever possible. This isdue to the form of the logarithm function that decreases quickly to zerocompared to its increase for large values of w_(j). Thus, it is betterto increase one w_(j) while setting another to zero rather than making acompromise between both. From this point forward, it will be assumedthat ε is equal to the machine precision. To solve the problem ofEquation 17, an iterative method is used which performs a gradient stepat each iteration. The method known as Franke and Wolfe's method isproved to converge to a local minimum. For the problem of interest, ittakes the following form, which can be defined as an “Approximation ofthe l₀-norm Minimization”, or “AL0M”:

1. Start with w=(1, . . . , 1)

2. Assume w_(k) is given. Solve

$\begin{matrix}{{\min\;{\sum\limits_{j = 1}^{n}{{w_{j}}\mspace{20mu}{subject}\mspace{14mu}{to}\text{:}}}}{{y_{i}\left( {\left\langle {w,\left( {x_{i}*w_{k}} \right)} \right\rangle + b} \right)} \geq 1.}} & (23)\end{matrix}$

3. Let ŵ be the solution of the previous problem. Set W_(k+1)=w_(k)*ŵ,where * denotes element-wise multiplication.

4. Repeat steps 2 and 3 until convergence.

AL0M solves a succession of linear optimization problems with non-sparseconstraints. Sometimes, it may be more advantageous to have a quadraticprogramming formulation of these problems since the dual may have simpleconstraints and may then become easy to solve. As a generalization ofthe previous method, the present embodiment uses a procedure referred toas “l₂-AL0M” to minimize the l₀ norm as follows:

1. Start with w=(1, . . . , 1)

2. Assume w_(k) is given. Solve

$\begin{matrix}{{\min\limits_{w}{{w}_{2}^{2}\mspace{20mu}{subject}\mspace{14mu}{to}\text{:}\mspace{14mu}{y_{i}\left( {\left\langle {w,\left( {x_{i}*w_{k}} \right)} \right\rangle + b} \right)}}} \geq 1.} & (24)\end{matrix}$

3. Let ŵ be the solution of the previous problem. Set w_(k+1)=w_(k)*ŵ.

4. Repeat steps 2 and 3 until convergence.

This method is developed for a linearly-separable learning set.

When many classes are involved, one could use a classical trick thatconsists of decomposing the multiclass problem into many two-classproblems. Generally, a “one-against-all” approach is used. One vectorw_(c) and one bias b_(c) are defined for each class c, and the output iscomputed as

$\begin{matrix}{{f(x)} = {{\arg\mspace{14mu}{\max\limits_{c}\left\langle {w_{c},x} \right\rangle}} + {b_{c}.}}} & (25)\end{matrix}$Then, the vector w_(c) is learned by discriminating the class c from allother classes. This gives many two-class problems. In this framework,the minimization of the l₀-norm is done for each vector w_(c)independently of the others. However, the true l₀-norm is the following:

$\begin{matrix}{\sum\limits_{c = 1}^{K}{w_{c}}_{0}} & (26)\end{matrix}$where K is the number of classes. Thus, applying this kind ofdecomposition scheme adds a suboptimal process to the overall method. Toperform a l₀-norm minimization for the multi-class problems, theabove-described l₂ approximation method is used with differentconstraints and a different system:

1. Start with w_(c)=(1, . . . , 1), for c=1, . . . , K

2. Assume W_(k)=(w₁ ^(k), . . . w_(c) ^(k), . . . , w_(K) ^(k))is given.Solve

$\begin{matrix}{\min\limits_{w}{\sum\limits_{c = 1}^{K}{w_{c}}_{2}^{2}}} & (27)\end{matrix}$

-   -   subject to: <w_(c(i)), (x_(i)*w_(c)i)) ^(k))>−<w_(c),        (x_(i)*w_(c) ^(k))>+b_(c(i))−b_(c)≧1 for c=1, . . . , K.

3. Let Ŵ be the solution to the previous problem. Set W_(k+1)=W_(k)*Ŵ.

4. Repeat steps 2 and 3 until convergence.

As for the two-class case, this procedure is related to the followingminimization problem:

$\begin{matrix}{\min\limits_{w}{\sum\limits_{c = 1}^{K}{\sum\limits_{j = 1}^{n}{\ln\left( {ɛ + {{\left( w_{c} \right)_{j}}\mspace{20mu}{subject}{\;\mspace{14mu}}{to}\text{:}}} \right.}}}} & \; \\{{\left\langle {w_{c{(i)}},\left( {x_{i}*w_{c{(i)}}^{k}} \right)} \right\rangle - \left\langle {w_{c},\left( {x_{i}*w_{c}^{k}} \right)} \right\rangle + b_{c{(i)}} - b_{c}} \geq 1} & (28)\end{matrix}$for k=1, . . . , C.

In order to generalize this algorithm to the non-linear case, thefollowing dual optimization problem must be considered:

$\begin{matrix}{{{\max\limits_{\alpha_{i}}{{- \frac{1}{2}}{\sum\limits_{i,{j = 1}}^{i}{\alpha_{i}\alpha_{j}y_{i}y_{j}\left\langle {x_{i},x} \right\rangle}}}} + {\sum\limits_{i = 1}^{i}\alpha_{i}}}{{subject}\mspace{14mu}{to}\mspace{14mu}{\sum\limits_{i = 1}^{i}{\alpha_{i}y_{i}}}}} & (29)\end{matrix}$where C≧α_(i)≧0 and where α_(i) are the dual variables related to theconstraints y_(i)(<w,x,>+b)≧1−ξ_(i). The solution of this problem canthen be used to compute the value of w and b. In particular,

$\begin{matrix}{\left\langle {w,x} \right\rangle = {\sum\limits_{i = 1}^{l}{\alpha_{i}y_{i}{\left\langle {x_{i},x} \right\rangle.}}}} & (30)\end{matrix}$This means that the decision function of Equation 1, which may also berepresented as D(x)=sign (<w,x>+b), can only be computed using dotproducts. As a result, any kind of function k(. , .) can be used insteadof <. , .> if k can be understood as a dot-product. Accordingly, thefunction Φ(x)=φ₁(x), . . . , φ_(i)(x) εl₂ which maps points xi into afeature space such that k(x_(i), x₂) can be interpreted as a dot productin feature space.

To apply the procedure in feature space it is necessary to computeelement-wise multiplication in feature space. In order to avoid directlycomputing vectors in feature space, this multiplication is performedwith kernels. Multiplications in feature space are of the formφ(x)*φ(y).

First, consider feature spaces which are sums of monomials of order d.That is, kernels that describe feature spaces of the formφ_(d)(x)=<x _(i) ₁ , . . . , x _(i) _(d) :i≦i ₁ , i ₂ , . . . , i _(d)≦N>.Next, perform an element-wise multiplication in this space:φ_(d)(x)*φ_(d)(y)=<x _(i) ₁ y _(i) ₁ . . . x _(i) _(d) y _(i) _(d) :i≦i₁ , i ₂ , . . . , i _(d) ≦N>, which is equal toφ_(d)(x*y)=<x _(i) ₁ y _(i) ₁ . . . x _(i) _(d) y _(i) _(d) :i≦i ₁ , i ₂, . . . , i _(d) ≦N>.Therefore, instead of calculating φ(x)*φ(y), the equivalent expressioncan be calculated, avoiding computation of vector in feature space. Thiscan be extended to feature spaces with monomials of degree d or less(polynomials) by noticing thatφ_(1:d)(x)=<φ_(p)(x):1≦p≦d>andφ_(1:d)(x*y)=φ_(1:d)(x)*φ_(1:d)(y).  (31)

Applying this to the problem at hand, one needs to compute both w*x_(i)and <(w*x_(i)), x_(j)<. The following shows how to approximate such acalculation.

After training an SVM, minimizing the l₂-norm in dual form, one obtainsthe vector w expressed by coefficients α

$\begin{matrix}{w = {\sum\limits_{i = 1}^{l}{\alpha_{i}y_{i}\phi\;{\left( x_{i} \right).}}}} & (32)\end{matrix}$The map φ(x_(i))→φ(x_(i))*w must be computed to calculate the dotproducts (the kernel) between training data. Thus, a kernel functionbetween data points x_(i) and x_(j) is nowk _(M) ₁ (x _(i) , x _(j))=<(φ(x _(i))*w),(φ(x _(j))*w)>=<φ(x _(i))*φ(x_(j)),(w*w)>.  (33)

Let the vector s_(i) be the vector that scales the original dot producton iteration i+1 of the algorithm. Then, s₀ is the vector of ones,s₁=(w*w) and, in general, s_(i)=s_(i−1)*(w*w) when w is the vector ofcoefficients from training on step i. Thus, the kernel function oniteration i+1 isk _(s) ₁ (x _(i) , x _(j))=<(φ(x _(i))*φ(x _(j)), s _(i)>.Considering the kernel for iteration 2, s₁,

$\begin{matrix}{{k_{s_{i}}\left( {x_{i},x_{j}} \right)} = \left\langle \left( {{{\phi\left( x_{i} \right)}*{\phi\left( x_{j} \right)}},s_{i}} \right\rangle \right.} \\{= \left\langle {\left( {{\phi\left( x_{i} \right)}*s_{i}} \right),{\phi\left( x_{j} \right)}} \right\rangle}\end{matrix}$Now, s₁=(w*w). For polynomial type kernels utilizing Equations 31 and32,

$\begin{matrix}{s_{i} = {\sum\limits_{n,{m = 1}}^{l}{\alpha_{n}\alpha_{m}y_{n}y_{m}\;{{\phi\left( {x_{n}*x_{m}} \right)}.}}}} & (34)\end{matrix}$This produces the kernelk _(s) ₁ (x _(i) , x _(j))=Σα_(i)α_(j) y _(i) y _(j) k(x _(n) *x _(m) *x_(i) , x _(j)),and in general n step n>0,

$\begin{matrix}{{s_{n} = {\sum\limits_{i_{1},\ldots,i_{n}}^{l}{\alpha_{i_{1}}y_{i_{1}}\;\ldots\;\alpha_{i_{n}}y_{i_{n}}{\phi\left( {x_{i_{1}}*\ldots\;*\ldots\mspace{11mu} x_{i_{n}}} \right)}}}},{{k_{s_{n}}\left( {x_{i},x_{j}} \right)} = {\sum\limits_{i_{1},\ldots\;,i_{n}}^{l}{\alpha_{i_{1}}y_{i_{1}}\;\ldots\;\alpha_{i_{n}}y_{i_{n}}{{k\left( {{x_{i_{1}}*\ldots\;*{\ldots x}_{i_{n}}*x_{i}},x_{j}} \right)}.}}}}} & (35)\end{matrix}$

As this can become costly to compute after iteration 2, the vector s₁can be computed at each step as a linear combination of training points,i.e.,s _(n)=Σβ_(i) ^(n)φ(x _(i))k _(s) _(n) (x _(i) , x _(j))=Σβ_(k) ^(n) k(x _(k) *x _(i) *x _(j))This can be achieved by, at each step, approximating the expansion

$\left( {w*w} \right) = {\sum\limits_{n,{m = 1}}^{l}{\alpha_{n}\alpha_{m}y_{n}y_{m}{\phi\left( {x_{n}*x_{m}} \right)}}}$with

$w_{approx}^{2} = {\sum\limits_{i = 1}^{l}{\beta_{i}{{\phi\left( x_{i} \right)}.}}}$The coefficients β_(i) are found using a convex optimization problem byminimizing in the l₂-norm the different between the true vector and theapproximation. That is, the following is minimized:

$\begin{matrix}\begin{matrix}{{{w_{approx}^{2} - \left( {w*w} \right)}}_{2}^{2} = {{{\sum\limits_{i = 1}^{l}{\beta_{i}{\phi\left( x_{i} \right)}}} - {\sum\limits_{n,{m = 1}}^{l}{\alpha_{n}\alpha_{m}y_{n}y_{m}{\phi\left( {x_{n}*x_{m}} \right)}}}}}_{2}^{2}} \\{= {{\sum\limits_{i,{j = 1}}^{l}{\beta_{i}\beta_{j}{k\left( {x_{i},x_{j}} \right)}}} -}} \\{{2{\sum\limits_{i,n,{m = 1}}^{l}{\beta_{i}\alpha_{n}\alpha_{m}y_{n}y_{n}{k\left( {{x_{n}*x_{m}},x_{i}} \right)}}}} + {{const}.}}\end{matrix} & (36)\end{matrix}$

Similarly, an approximation for s_(i)=s_(i−1)*(w*w) can also be found,again approximating the expansion found after the * operation. Finally,after the final iteration (p) test points can be classified usingf(x)=<w, (x*s _(p−1))>+b.  (37)

It is then possible to perform an approximation of the minimization ofthe l₀-norm in the feature space for polynomial kernels. Contrary to thelinear case, it is not possible to explicitly look at the coordinates ofthe resulting w. It is defined in feature space, and only dot productcan be performed easily. Thus, once the algorithm is finished, one canuse the resulting classifier for prediction, but less easily forinterpretation. In the linear case, on the other hand, one may also beinterested in interpretation of the sparse solution, i.e., a means forfeature selection.

To test the AL0M approach, it is compared to a standard SVM with nofeature selection, a SVM using RFE, and Pearson correlationcoefficients.

In a linear problem, six dimensions out of 202 were relevant. Theprobability of y=1 or −1 was equal. The first three features {x₁, x₂,x₃} were drawn as x_(i)=yN(i, 1) and the second three features {x₄, x₅,x₆} were drawn as x_(i)=N(0,1) with a probability of 0.7. Otherwise, thefirst three features were drawn as x_(i)=yN(0,1) and the second three asx_(i)=yN(i−3,1). The remaining features are noise x_(i)=N(0,20), i=7, .. . , 202. The inputs are then scaled to have a mean of zero and astandard of one.

In this problem, the first six features have redundancy and the rest ofthe features are irrelevant. Linear decision rules were used and featureselection was performed selecting the two best features using each ofthe above-mentioned methods along with AL0M SVMs, using both l₁ and l₂multiplicative updates. Training was performed on 10, 20 and 30 randomlydrawn training points, testing on a further 500 points, and averagingtest error over 100 trials. The results are provided in Table 1. Foreach technique, the test error and standard deviation are given.

TABLE 1 Method 10 pts. 20 pts. 30 pts. SVM 0.344 ± 0.07 0.217 ± 0.040.162 ± 0.03 CORR SVM 0.274 ± 0.15 0.157 ± 0.07 0.137 ± 0.03 RFE SVM0.268 ± 0.15 0.114 ± 0.10 0.075 ± 0.06 l₂-AL0M SVM 0.270 ± 0.15 0.097 ±0.10 0.063 ± 0.05 l₁-AL0M SVM 0.267 ± 0.16 0.078 ± 0.06 0.056 ± 0.04

AL0M SVMs slightly outperform RFE SVMs, whereas conventional SVMoverfit. The l₀ approximation compared to RFE also has a lowercomputational cost. In the RFE approach, n iterations are performed,removing one feature per iteration, where n is the number of inputdimensions. As described with regard to the previous embodiment, RFE canbe sped up by removing more than one feature at a time.

Table 2 provides the p-values for the null hypothesis that the algorithmin a given row does not outperform the algorithm in a given column usingthe Wilcoxon signed rank test. The Wilcoxon test evaluates whether thegeneralization error of one algorithm is smaller than another. Theresults are given for 10, 20 and 30 training points.

TABLE 2 l₂-AL0M l₁-AL0M Method SVM CORR RFE SVM SVM SVM (10 pts) 1 1 1 11 (20 pts) 1 1 1 1 1 (30 pts.) 1 1 1 1 1 CORR (10 pts.) 0 1 0.33 0.370.34 (20 pts) 0 1 1 1 1 (30 pts.) 0 1 1 1 1 RFE (10 pts.) 0 0.67 1 0.560.39 (20 pts) 0 0 1 0.98 1 (30 pts.) 0 0 1 0.96 1 l₂-AL0M (10 pts.) 00.63 0.44 1 0.35 (20 pts) 0 0 0.02 1 0.97 (30 pts.) 0 0 0.04 1 0.97l₁-AL0M (10 pts.) 0 0.66 0.61 0.65 1 (20 pts) 0 0 0 0.03 1 (30 pts.) 0 00 0.03 1For 20 and 30 training points, the l₁-AL0M SVM method outperforms allother methods. The next best results are obtained with the l₂-AL0M SVM.This ranking is consistent with the theoretical analysis of thealgorithm—the l₂-norm approximation should not be as good at choosing asmall subset of features relative to the l₁ approximation which iscloser to the true l₀-norm minimization.

EXAMPLE 2 Colon Cancer Data

Sixty-two tissue samples probed by oligonucleotide arrays contain 22normal and 40 colon cancer tissues that must be discriminated based uponthe expression of 2000 genes. Splitting the data into a training set of50 and a test set of 12 in 500 separate trials generated a test error of16.6% for standard linear SVMs. Then, the SVMs were trained withfeatures chosen by three different input selection methods: correlationcoefficients (“CORR”), RFE, and the l₂-norm approach according to thepresent embodiment. Subsets of 2000, 1000, 500, 250, 100, 50 and 20genes were selected. The results are provided in the following Table 3.

TABLE 3 No of Features CORR SVM RFE SVM l₂-AL0M 2000 16.4% ± 8  16.4% ±8 16.4% ± 8 1000 17.7% ± 9  16.4% ± 9 16.3% ± 9 500 19.1% ± 9  15.8% ± 916.0% ± 9 250 18.2% ± 10 16.0% ± 9 16.5% ± 9 100 20.7% ± 10 15.8% ± 915.2% ± 9 50 21.6% ± 10 16.0% ± 9  15.1% ± 10 20 22.3% ± 11  18.1% ± 10 16.8% ± 10AL0M SVMs slightly outperform RFE SVMs, whereas correlation coefficients(CORR SVM) are significantly worse. Table 4 provides the p-values usingthe Wilcoxon sign rank test to demonstrate the significance of thedifference between algorithms, again in showing that RFE SVM and l₂-AL0Moutperform correlation coefficients. l₂-AL0M outperforms RFE for smallfeature set sizes.

TABLE 4 Method 2000 1000 500 250 100 50 20 CORR < RFE 0.5 0 0 0 0 0 0CORR < l₂-AL0M 0.5 0 0 0 0 0 0 RFE < l₂-AL0M 0.5 0.11 0.92 0.98 0.030.002 0.001

EXAMPLE 3 Lymphoma Data

The gene expression of 96 samples is measured with microarrays to give4026 features, with 61 of the samples being in classes “DLCL”, “FL”, or“CL” (malignant) and 35 labeled otherwise (usually normal.) Using thesame approach as in the previous Example, the data was split intotraining sets of size 60 and test sets of size 36 over 500 separatetrials. A standard linear SVM obtains 7.14% error. The results using thesame feature selection methods as before are shown in Table 5.

TABLE 5 Features CORR SVM RFE SVM l₂-AL0M SVM 4026 7.13% ± 4.2 7.13% ±4.2 7.13% ± 4.2 3000 7.11% ± 4.2 7.14% ± 4.2 7.14% ± 4.2 2000 6.88% ±4.3 7.13% ± 4.2 7.06% ± 4.3 1000 7.03% ± 4.3 6.89% ± 4.2 6.86% ± 4.2 5007.40% ± 4.3 6.59% ± 4.2 6.48% ± 4.2 250 7.49% ± 4.5 6.16% ± 4.1 6.18% ±4.2 100 8.35% ± 4.6 5.96% ± 4.0 5.96% ± 4.1 50 10.14% ± 5.1  6.70% ± 4.36.62% ± 4.2 20 13.63% ± 5.9  8.08% ± 4.6 8.57% ± 4.5

RFE and the approximation to the l₂-AL0M again outperform correlationcoefficients. l₂-AL0M and RFE provided comparable results. Table 6 givesthe p-values using the Wilcoxon sign rank test to show the significanceof the difference between algorithms.

TABLE 6 Features CORR < RFE CORR < l₂-AL0M RFE < l₂-AL0M 4026 0.5 0.50.5 3000 0.578 0.578 0.5 2000 0.995 0.96 0.047 1000 0.061 0.04 0.244 5000 0 0.034 250 0 0 0.456 100 0 0 0.347 50 0 0 0.395 20 0 0 0.994

EXAMPLE 4 Yeast Dataset

A microarray dataset of 208 genes (Brown Yeast dataset) wasdiscriminated into five classes based on 79 gene expressionscorresponding to different experimental conditions. Two 8cross-validation runs were performed. The first run was done with aclassical multiclass SVM without any feature selection method. Thesecond run was done with a SVM and a pre-processing step using themulticlass l₂-AL0M procedure to select features. Table 7 shows theresults, i.e., that the l₂-AL0M multiclass SVM outperforms the classicalmulticlass SVM. As indicated, the number of features following featureselection is greatly reduced relative to the original set of features.

TABLE 7 Test Error No. of Features multiclass SVM 4.8% 79 l₂-AL0Mmulticlass SVM 1.3% 20

As an alternative to feature selection as a pre-processing step, it maybe desirable to select a subset of features after mapping the inputsinto a feature space while preserving or improving the discriminativeability of a classifier. This can be distinguished from classicalfeature selection where one is interested in selecting from the inputfeatures. The goal of kernel space feature selection is usually one ofimproving generalization performance rather than improving running timeor attempting to interpret the decision rule.

An approach to kernel space feature selection using l₂-AL0M is comparedto the solution of SVMs on some general toy problems. (Note that inputfeature selection methods such as those described above are not comparedbecause if w is too large in dimension, such methods cannot be easilyused.) Input spaces of n=5 and n=10 and a mapping into feature space ofpolynomial of degree 2, i.e., φ(x)=φ_(1:2)(x) were chosen. The followingnoiseless problems (target functionals) were chosen: (a) ƒ(x)=x₁x₂+x₃,(b) ƒ(x)=x₁; and randomly chosen polynomial functions with (c) 95%, (d)90%, (e) 80% and (f) 0% sparsity of the target. That is, d % of thecoefficients of the polynomial to be learned are zero. The problems wereattempted for training set sizes of 20, 50, 75 and 100 over 30 trials,and test error measured on a further 500 testing points. Results areplotted in FIGS. 11 a-f for n=5 and FIGS. 12 a-f for n=10. The dottedlines in the plots represent the performance of a linear SVM on aparticular problem, while the solid line represent the AL0M -SVMperformance. Multiplicative updates were attempted using the methoddescribed above for polynomial kernels (see, e.g., Equation 31). Notethat comparing to the explicit calculation of w, which is not alwayspossible if w is large, the performance was identical. However, this isnot expected to always be the case since, if the required solution doesnot exist in the span of the training vectors, then Equation 36 could bea poor approximation.

In problems (a) through (d), shown in FIGS. 11 a-d and 12a-d, thel₂-AL0M SVM clearly outperforms SVMs. This is due to the norm that SVMsuse. The l₂-norm places a preference on using as many coefficients aspossible in its decision rule. This is costly when the number offeatures one should use is small.

In problem (b) (FIGS. 11 b and 12 b), where the decision rule to belearned is linear (just one feature), the difference is the largest. Thelinear SVM outperforms the polynomial SVM, but the AL0M featureselection method in the space of polynomial coefficients of degree 2outperforms the linear SVM. This suggests that one could also try to usethis method as a crude form of model selection by first selecting alarge capacity model then reducing the capacity by removing higher orderpolynomial terms.

Problems (e) and (f) (FIGS. 11 e-f and 12 e-f) show an 80% and 0%sparse, respectively. Note that while the method outperforms SVMs in thecase of spare targets, it is not much worse when the target is notsparse, as in problem (f). In problem (e), the AL0M method is slightlybetter than SVMs when there are more than 50 training points, but worseotherwise. This may be due to error when making sparse rules when thedata is too scarce. This suggests that it may be preferable in certaincases to choose a rule that is only partially sparse, i.e., something inbetween the l₂-norm and l₀-norm. It is possible to obtain such a mixtureby considering individual iterations. After one or two iterations ofmultiplicative learning, the solution should still be close to thel₂-norm. Indeed, examining the test error after each of one, two andthree iterations with 20 or 50 training points on problem (e), it isapparent that the performance has improved compared to the l₂-norm (SVM)solution. After further iterations, performance deteriorates. Thisimplies that a method is needed to determine the optimal mixture betweenthe l₀-norm and l₂-norm in order to achieve the best performance.

In application to feature selection, the l₀-norm classifier separatesdata using the least possible number of features. This is acombinatorial optimization problem which is solved approximately byfinding a local minimum through the above-described techniques. Thesefeatures can then be used for another classifier such as a SVM. If thereare still too many features, they can be ranked according to theabsolute value of the weight vector of coefficient assigned to them bythe separating hyperplane.

Feature selection methods can also be applied to define very sparse SVM.For that purpose, it is useful to review the primal optimization problemfrom which SVMs are defined:

${\min\limits_{w,\xi_{i}}{w}_{2}^{2}} + {C{\sum\limits_{i = 1}^{l}{\xi_{i}\mspace{14mu}{subject}\mspace{14mu}{to}\text{:}}}}$y_(i)(⟨w, x_(i)⟩ + b) ≥ 1 − ξ_(i) ξ_(i) ≥ 0

It is known that solutions ŵ of such problems are expressed as acombination of the kernels k(x_(i), .):

$\begin{matrix}{\hat{w} = {\sum\limits_{i = 1}^{l}{\alpha_{i}y_{i}{k\left( {x_{i},\ldots}\; \right)}}}} & (38)\end{matrix}$and the non-zero α_(i)'s are called the support vectors in the sensethat they support the computation of the vector ŵ. One of the positiveproperties that has been underlined by many theoretical bounds is thatwhen this number is small, the generalization error should be small aswell. The proof of such results relies on compression arguments and itis generally believed that compression is good for learning. Followingthe latter assertion and applying the l₀-norm minimization to the vector(α₁, . . . , α_(l)), the goal is to obtain a linear model with as fewnon-zero α_(i) as possible. The problem to be addressed is:

$\begin{matrix}{{\min\limits_{\alpha_{i},\xi_{i}}{{\alpha }_{0}\mspace{25mu}{subject}\mspace{20mu}{to}\text{:}}}{{y_{i}\left( {\sum\limits_{j = 1}^{l}{\alpha_{j}{k\left( {x_{j},x_{i}} \right)}}} \right)} \geq 1}} & (39)\end{matrix}$

To solve this problem, the previously-described approximation is used.The slack variables ξ_(i) are no longer used since the approach has beendefined for hard margin (C=+∞) only. The following routine is used toaddress the non-separable datasets,

1. Start with α=(1, . . . , 1)

2. Assume α_(k) is given, and solve

$\begin{matrix}{\min\mspace{11mu}{\sum\limits_{j = 1}^{n}{\alpha_{j}}}} & (40)\end{matrix}$

-   -   subject to: y_(i)(<α,(k(x_(i),.)*α_(k))>+b)≦1

3. Let {acute over (α)} the solution of the previous problem. Setα_(k+1)=α_(k)*{acute over (α)}.

4. Go back to 2.

The present method tends to behave like a SVM but with a very sparseexpansion (see Equation 38). To illustrate, the following simple exampleis used. The learning set consists of 100 points in [0,1]² with ±1targets as they are drawn in FIG. 13.

For this learning set, multiple runs of a classic SVM and a sparse SVMwere run. A radial basis function (RBF) kernel with a value of σ=0.1 wasused. In FIG. 13 a, the result of the sparse SVM is shown, while FIG. 13b shows the result of a classic SVM. In this problem, the parse-SVMobtains a sparser and a better solution in terms of generalizationerror. At a first sight, this can be interpreted as a consequence ofexisting theoretical results about sparsity that say that sparsity isgood for generalization. However, in this case, sparsity is more relatedto full compression than to computational dependence. “Computationaldependence” here means that the outcome of the algorithm depends only ona small number of points in the learning set. For example, the supportvectors of a SVM are the only points in the learning set that are usedand the optimization procedure would have given the same linear model ifonly these points had been considered. In the present method, theresulting linear model depends on all of the training points since allof these points have been used in the optimization to reach a localminimum of the concave problem at hand. If one point is changed, even ifit does not occur in the final expansion (Equation 38), it may stillhave a role in orienting the optimization procedure to the current localminimum rather than to another one.

To further analyze the difference between a SVM and the sparse-SVM, theformer example is continued but with a smaller value of σ. FIGS. 14 a-brepresent the separation obtained for a SVM and the sparse-SVMrespectively, plotted for α=0.01. The circled points in the plotsindicate learning points with a non-zero α₁. Thus, it can be seen thatthe number of non-zero α_(i)'s is smaller with the sparse-SVM than for aclassical SVM. The resulting separation is not very smooth and themargin tends to be small. This behavior is enhanced when σ=0.001, theresults of which are shown in FIGS. 14 c-d. Then, nearly all of thenon-zero α_(i)'s of the sparse-SVM are from the same class and thesystem has learned to answer with roughly always the same class (thecircles) except in small regions.

The preceding approach can be used in a kernel-based vector quantization(VQ) method to provide a set of codebook vectors. VQ tries to representa set of l data vectorsx₁, . . . , x_(l)εχ  (40)by a reduced number of m codebook vectorsy₁, . . . y_(m)εχ,  (41)such that some measure of distortion is minimized when each x isrepresented by the nearest (in terms of some metric on χ) y.

Often, codebooks are constructed by greedy minimization of thedistortion measure, an example being the overall l₂ error.

$\begin{matrix}{{E_{VQ} = {\sum\limits_{i = 1}^{l}{{x_{i} - {y\left( x_{i} \right)}}}^{2}}},{where}} & (42) \\{{y\left( x_{i} \right)} = {\arg\mspace{14mu}{\min\limits_{y_{j}}{{{x_{i}y_{j}}}^{2}.}}}} & (43)\end{matrix}$In practice, one specifies m, initializes y, and minimizes a distortionmeasure such as Equation 42. Given receiver knowledge of the codebook,each x can then be compressed into log₂ m bits.

Now, consider the case where it is desired to guarantee (for thetraining set) a maximum level of distortion for any encoded x andautomatically find an appropriate value for m. A restriction that mustbe imposed is that the codebook vectors must be a subset of the data.Finding the minimal such subset that can represent the data with a givenlevel of distortion is a combinatorially hard problem, however aneffective sparse codebook can be obtained using the above-describedtechniques.

VQ algorithms using multiplicative updates can be obtained according tothe following:

Given the data of Equation 40, χ is some space endowed with a metric d.For purposes of this discussion, the term “kernel” will be used in abroader sense to mean functionsk:χ×χ→

  (44)not necessarily satisfying the conditions of Mercer's theorem.

Of particular interest is the kernel that indicates whether two pointslie within a distance of R≧0,k(x,x′)=1_({(x,x′)εχ×χ:d(x,x′)≦) R}.  (45)

Now considering what is sometimes called the “empirical kernel map”,φ_(l)(x)=(k(x ₁ , x), . . . , k(x _(l) , x)),  (46)

a vector w ε

^(l) can be found such thatw ^(T)φ_(l)(x _(i))>0  (47)holds true for all i=1, . . . , l. Then each point x_(i) lies within adistance R of some point x_(j) which has a positive weight w_(j>)0,i.e., the points with positive weights form a R-cover of the whole set.To see this, note that otherwise all nonzero components of w would bemultiplied by a component φ_(l) which is 0, and the above dot productwould equal 0.

Therefore, up to an accuracy of R (measured in the metric d), the x_(j)with nonzero coefficients w_(j) can be considered an approximation ofthe complete training set. (Formally, they define a R-cover of thetraining set in the metric d.) In order to keep the number of nonzerocoefficients small, the following optimization problem is considered:

-   -   for some q≧0, compute:

$\begin{matrix}{\min\limits_{w \in R_{l}}{w}_{q}} & (48)\end{matrix}$

-   -   -   subject to:            w ^(T)φ_(l)(x _(i))≧1.  (49)            In this formulation, the constraint of Equation 47 has been            slightly modified. To understand this, note that if a vector            w satisfies Equation 47, it can be rescaled to satisfy            Equation 49. For the optimization, however, Equation 49 must            be used to ensure that the target function of Equation 48            cannot be trivially minimized by shrinking w.

In vector quantization, the goal is to find a small codebook of vectorsto approximate the complete training set. Ideally, q=0 would be used, inwhich case ∥w∥_(q) would simply count the number of nonzero coefficientsof w.

In practice, q=1 or q=2 is used, since both lead to nice optimizationproblems. Using this elementary optimization procedure, themultiplicative updated rule is applied.

While q=1 already leads to fairly sparse expansions without themultiplicative rule, q=2 does not. However, with the multiplicativerule, both values of q performed fairly well. At the end of theoptimization, the nonzero w_(i) corresponds to a sufficient set ofcodebook vectors y_(i). Occasionally, duplicate codebook vectors areobtained. This occurs if the R-balls around two vectors containidentical subsets of the training set. To avoid this, a pruning step isperformed to remove some of the remaining redundant vectors in thecodebook, by sequentially removing any codebook vector that does notexclusively explain any data vector. Equivalently, the pruning step canbe applied to the training set before training. If there are two pointswhose R-balls cover the same subsets of the training set, they can beconsidered equivalent, and one of them can be removed. Typically, thepruning results in the removal of a further 1-5% of the codebook.

Although the present approach is much faster than a combinatorialapproach, it will still suffer from computational problems if thedataset is too large. In that case, chunking or decomposition methodscan be employed. Such methods include (1) where the inner optimizationuses the l₂ -norm, it has a structure similar to the standard SVMoptimization problem. In this case, SVM chunking methods can be used.(2) Forms of chunking can be derived directly from the problemstructure. These include greedy chunking, SV-style chunking, andparallel chunking.

Greedy chunking involves two steps. In Step 1, start with a first chunkof 0<l₀<l datapoints. Solve the problem for these points, obtaining aset of m≦l₀ codebook vectors. Next, go through the remaining l-l₀ pointsand discard all points that have already been covered. In the nth step,provided there are still points left, take a new chunk of l₀ from theremaining set, find the codebook vectors, and removed all points whichare already covered.

The set of codebook vectors chosen in this manner forms a R-cover of thedataset. In the extreme case where l₀=1, this reduces to a greedycovering algorithm. In the other extreme, where l₀=l, it reduces to theoriginal algorithm.

SV style chunking has the following inner loop:

Step 1: Start with the first chunk of 0<l₀<l datapoints. Solve theproblem for these points, obtaining a set of m≦l₀ codebook vectors anddiscarding the rest from the current chunk. Next, start going throughthe remaining l-l₀ points and fill up the current chunk until is hassize l₀ again. In Step n, provided there are still points left and thecurrent chunk has size smaller than l₀, proceed as above and fill it up.If it is already full, remove a fixed number of codebook vectors alongwith all points falling into their respective r-balls.

Once this is finished, go back and check to see whether all points ofthe training set (apart from those which were removed along with theircodebook vectors) are r-covered by the final chunk. Any points which donot satisfy this are added to the chunk again, using the above techniquefor limiting the chunk size.

Note that there is not guarantee that all points will be covered afterthe first loop. Therefore, in the SV-style chunking algorithm, the innerloop must be repeated until everything is covered.

In parallel chunking, the dataset is split into p parts of equal size.The standard algorithm is run on each part using a kernel of width R/2.Once this is done, the union of all codebooks is formed and a cover iscomputed using width R/2. By the triangular inequality, a R-cover of thewhole dataset is obtained.

There is no restriction on adding points to the training set. If thereis a way to select points which will likely by good codebook vectors,then the algorithm will automatically pick them in its task of finding asparse solution. Useful heuristics for finding such candidate pointsinclude a scheme where, for each codebook vector, a new point isgenerated by moving the codebook vector in the direction of a pointhaving the property that, among all points that are coded only by thepresent codebook vector, has the maximum distance to that codebookvector. If the distance of the movement is such that no other pointsleave the cover, the original codebook vector can be discarded in favorof the new one.

The use of the multiplicative rule avoids a somewhat heuristicmodification of the objective function (or of the kernel) that theoriginal approach required.

EXAMPLE 5 Vector Quantization

As a simple illustrative example, FIG. X+4 shows quantizations obtainedusing the multiplicative kernel VQ algorithm (mk-VQ) of two-dimensionaldata uniformly distributed in the unit square, for values of maximumdistortion R=0.1, 0.2, 0.3 and 0.4, and d being the Euclidean l₂distance. In all cases, the proposed algorithm finds solution which areeither optimal or close to optimal. Optimality is assessed using thefollowing greedy covering algorithm. At each step, find a point with theproperty that the number of points contained in a R-ball around it ismaximal. Add this point to the codebook, remove all points in the ballfrom the training set, and, if there are still points left, go back tothe beginning. In FIGS. 15 a-d, for l=300, the m quantizing vectors,found automatically, are shown circled. Their numbers are m=37, 11, 6,4, which are shown in FIGS. 15 a-d, respectively. Using the greedycovering algorithm, guaranteed lower bounds on the minimal number ofcodebook vectors were found. The bound values are 23, 8, 4, 3. Thecircles of d(x,y)=R are shown for each codebook vector.

The VQ method described herein allows an upper bound to be specified onthe distortion incurred for each coded vector. The algorithmautomatically determines small codebooks covering the entire dataset. Itworks on general domains endowed with a metric, and could this equallywell be used to compute coverings of function spaces. The algorithmcould be used for data reduction as a pre-processing step, e.g., forclassification, by separating the codebook vectors with a margin largerthan R. Target values can be incorporated in the constraint such that itis preferable for the algorithm to find “pure clusters”. It should benoted that although the term vector quantization has been used, themethod can be applied to non-vectorial data, as long as it is possibleto define a suitable distance measure d. VQ methods may also be used tocompress or encode images.

Use of l₀-norm for Multi-label Problems.

The preceding method for feature selection can be used in cases ofmulti-label problems such as frequently arise in bioinformatics. Thesame multiplicative update rule is used with the additional step ofcomputing the label sets size s(x) using ranking techniques. In order tocompute the ranking, a cost function and margin for multi-label modelsare defined.

Cost functions for multi-label problems are defined as follows:

Denote by X an input space. An output space is considered as the spaceformed by all the sets of integer between 1 and Q identified as thelabels of the leaning problem. Such an output space contains 2Q elementsand one output corresponds to one set of labels. The goal is to findfrom a learning set S={(x₁,Y₁), . . . , (x_(m), Y₁)}⊂(

×

)^(m) drawn identically and independently (i.i.d.) from an unknowndistribution D, a function ƒ such that the following generalizationerror is as low as possible:R(f)=E _((x,Y)≈D) [c(f, x, Y)]  (50)

The function c is a real-valued loss and can take different formsdepending on how it is viewed. Here, two types of loss are considered.The first is the “Hamming Loss”, which is defined as

$\begin{matrix}{{{HL}\left( {f,x,Y} \right)} = {\frac{1}{Q}{{{f(x)}{\Delta Y}}}}} & (51)\end{matrix}$where Δ stands for the symmetric difference of sets. Note that the moreƒ(x) is different from Y, the higher the error. Missing one label in Yis less important than missing two, which seems quite natural in manyreal situations. As an example, consider that the labels are possiblediseases related to different genes. Many genes can share the samedisease and lead to many diseases at the same time. Predicting only onedisease although there are three is worse than predicting only two.Having a Hamming Loss of 0.1 means that the expected number of times apair (x,y_(k)) has been misclassified is 0.1. Note that if the HammingLoss is scaled by Q, it is equal to the average number of wrong labelsfor each point.

Any function from

to

can be evaluated by the Hamming loss. In some cases, another type ofloss is considered. Let (r₁(x), . . . , r_(Q)(x)) where r_(k) :

→

, k=1, . . . , Q, be real numbers which are large if k is a label of xand low otherwise. Assume a perfect predictor s(x) of the size of thelabel sets of x. The output ƒ(x) is computed by taking the indices ofthe first largest s(x) elements in (r₁(x), . . . , r_(Q)(x)). Such afunction ƒ is referred as “ranking based”. Denote by Ŷ the complementaryset of Y in {1, . . . , Q}. The Ranking Loss is define as:

$\begin{matrix}{{{RL}\left( {f,x,Y} \right)} = {\frac{1}{{Y}{\hat{Y}}}{{\left( {i,j} \right) \in {{Y \times \hat{Y}\mspace{20mu}{s.t.\mspace{14mu}{r_{i}(x)}}} \leq {r_{j}(x)}}}}}} & (52)\end{matrix}$When s(x) is not given, it must be learned from the data. To assess thequality of different predictors of the size of the label sets, the SizeLoss is defined as:

$\begin{matrix}{{{SL}\left( {s,x,Y} \right)} = {\frac{1}{Q}{{{Y} - {s(x)}}}}} & (53)\end{matrix}$Note that a failure in prediction of the correct size Y is weighteddifferently depending on how different the prediction is from the truevalue. This is motivated by the same reasoning as for the Hamming Loss.

In the following, only linear models are used to perform the ranking andto learn the size of the label sets. Note, however, that the methodsdescribed herein are not intended to be limited to linear models, andthat the methods may be transformed in non-linear methods by using the“kernel trick” as is known in the art. For purposes of furtherdiscussion, it will be assumed that

is a finite dimensional real space that can be thought of as

^(d) where d is the number of features of the inputs. Given Q vectorsw₁, . . . , w_(Q) and Q bias b₁, . . ., b_(Q), there are two ways ofcomputing the output—the binary approach and the ranking approach.

With the binary approach:ƒ(x)=sign(<w ₁ x>+b ₁ , . . . , <w _(Q) ,x>+b _(Q))  (54)where the sign function applies component-wise. The value of ƒ(x) is abinary vector from which the set of labels can be retrieved easily bystating that label k is in the set (<w_(k),x>+b_(k))≧0. This way ofcomputing the output is discussed below. The natural cost function forsuch a model is the Hamming Loss.

With the ranking approach, assume that s(x), the size of the label setfor the input x, is known, and define:r _(k)(x)=<w _(k,) x)+b _(k)  (55)and consider that a label k is in the label set of x iff r_(k)(x) isamong the first s(x) elements (r₁(x), . . . , r_(Q)(x)).

For both systems, the empirical error measured by the appropriate costfunction must be minimized while at the same time controlling thecomplexity of the resulting model. This reasoning is based on theprinciple that having a large margin plus a regularization method leadsto a good system.

First looking at the binary approach, it is assumed that the function ƒis computed as in Equation 54. This means that the decision boundariesare defined by the hyperplanes (w_(k), x)+b_(k=)0. By analogy with thetwo-class case, the margin of ƒ on (x, Y) is defined as the signeddistance between (<w₁, x>)+b₁, . . . , <w_(Q), x>+b_(Q)) and thedecision boundary. It is equal to:

$\begin{matrix}{\min\limits_{k}{y_{k}\frac{\left\langle {w_{k},x} \right\rangle + b_{k}}{w_{k}}}} & (56)\end{matrix}$where y_(k) is a binary element equal to +1 if label k is in Y, and −1otherwise. For a learning set S, the margin can also be defined as:

$\begin{matrix}{\min\limits_{{({x,Y})} \in S}{{yk}\frac{\left\langle {w_{k},x} \right\rangle + b_{k}}{w_{k}}}} & (57)\end{matrix}$

Assuming that the Hamming Loss on the learning set S is zero, the largemargin paradigm we follow consists in maximizing the margin. Bynormalizing the parameters (w_(k), b_(k)) such that:∀(x, Y)εS, yk(<w_(k) , x)+b _(k))≧1  (58)with an equality for at least one x, the margin on S={(x_(i),Y_(i))}_(i=1, . . . Q) is equal to min_(k)∥w_(k)∥⁻¹. Here, Y_(i) isidentified with its binary representation: (y_(i1), . . . , y_(iQ))ε{−1, +1}^(Q). Maximizing the margin or minimizing its inverse yields tothe following problem:

$\begin{matrix}{\min\limits_{w_{k}b_{k}}\mspace{14mu}{\max\limits_{k}{w_{k}}^{2}}} & (59)\end{matrix}$

-   -   subject to:        y _(ik)(<w _(k) ,x _(i) >+b _(k))≧1, for i=1, . . . , m

Up to this point, it has assumed that the Hamming Loss is zero, which isunlikely to be the case in general. The Hamming Loss can be computed as:

$\begin{matrix}{{{HL}\left( {f,x,Y} \right)} = {\frac{1}{Q}{\sum\limits_{k = 1}^{Q}{\theta\left( {- {y_{ik}\left( {\left\langle {w_{k},x_{i}} \right\rangle + b_{k}} \right)}} \right)}}}} & (60)\end{matrix}$where θ(t)=0 for t≦0 and θ(t)=1 for t≧0. The previous problem can begeneralized by combining the minimization of the Hamming Loss and themaximization of the margin:

$\begin{matrix}{{\min\limits_{w_{k}b_{k}}\left( {\max\limits_{k}{w_{k}}^{2}} \right)} + {C{\sum\limits_{i = 1}^{m}{\frac{1}{Q}{\sum\limits_{k = 1}^{Q}{\theta\left( {{- 1} + \xi_{ik}} \right)}}}}}} & (61)\end{matrix}$subject to:y _(ik)((W _(k) , x _(i))+b _(k))≧1−ξ_(ik), for i=1, . . , m ξ _(ik≧)0

This problem is non-convex and difficult to solve. (It inherits from theNP-Hardness of the related two-class problems which is a classical SVMwith threshold margin.) A simpler approach is to upper bound the θ(−1+t)function by the linear problem, which yields:

$\begin{matrix}{{\min\limits_{w_{k}b_{k}}\left( {\max\limits_{k}{w_{k}}^{2}} \right)} + {\frac{C}{Q}{\sum\limits_{i = 1}^{m}{\sum\limits_{k = 1}^{Q}\xi_{ik}}}}} & (62)\end{matrix}$subject to:y _(ik)((W _(k) , x _(i))+b _(k))≧1−ξ_(ik), for i=1, . . , m ξ_(ik≧)0

This problem is a convex problem with linear constraints and can besolved by using a gradient descent or other classical optimizationmethod. Note that when the constant C is infinite, which corresponds tothe case where the minimization of the Hamming Loss is actually the mainobjective function, then the system is completely decoupled and theoptimization can be done on each parameter (w_(k), b_(k)) independently.For finite C, the optimization cannot be done so easily, unless themaximum over the w_(k)×s is approximated by:

$\begin{matrix}{{\max\limits_{k}{w_{k}}^{2}} \leq {\sum\limits_{k = 1}^{Q}{w_{k}}^{2}} \leq {Q\mspace{11mu}{\max\limits_{k}{w_{k}}^{2}}}} & (63)\end{matrix}$

In this case, the problem becomes completely separated in the sense thatoptimizing with respect to w_(k) does not influence the optimizationwith respect to w₁ for l≠k. Note that the optimization procedure is thesame as the one-against-all approach developed in the multi-classsetting.

In Equation 62, the Hamming Loss is replaced by its linear approximation(here computed on (x, Y)):

$\begin{matrix}{{{AHL}\left( {\left( {w_{k},b_{k}} \right)_{{k = 1},\ldots,Q},x,Y} \right)}\frac{1}{Q}{\sum\limits_{k = 1}^{Q}{{{- 1} + {y_{k}\left( {\left\langle {w_{k},x} \right\rangle + b_{k}} \right)}}}_{+}}} & (64)\end{matrix}$where |.|+ is a function from

to

₊ that is the identity on

₊ and that equals zero on

⁻. This linear system is then designed to minimize the AHL functionrather than the Hamming distance between the binary representation ofthe label sets and the function ƒ defined in Equation 53. Duringtraining, it tends to place the output ƒ(x) close to the targets Y interms of the distance derived from the AHL. When a new input x is given,the output of ƒ should then be computed via:

$\begin{matrix}{{f(x)} = {\arg\mspace{14mu}{\min_{Y \in y}{\sum\limits_{k = 1}^{Q}{{{- 1} + {y_{k}\left( {\left\langle {w_{k},x} \right\rangle + b_{k}} \right)}}}_{+}}}}} & (65)\end{matrix}$where

contains all label sets that are potential outputs. When all label setsare acceptable outputs (

={−1, +1}^(Q), Equation 65 is rewritten as Equation 54. In some caseswhere all label sets are not allowed (

{−1, +1}^(Q)), both computations are different (see FIG. 16) and ƒ(x)should be calculated as above rather than using Equation 54. Such casesarise when, e.g., a Error Correcting Output Code (ECOC) is used to solvemulti-class problems. T. G. Dietterich and G. Bakiri in “Solvingmulticlass learning problems via error-correcting output codes”, Journalof Artificial Intelligence Research, 2:263-286, 1995, incorporatedherein by reference, present a method based on error correcting codewhich consists in transforming a multi-class problem into a multi-labelone and in using the fact that not all label sets are allowed. When thelearning system outputs an impossible label set, the choice of thecorrecting code makes it possible to find a potentially correct labelset whose Hamming distance is the closest to the output. If the systemderived from Equation 62 is used with the ECOC approach, consideringthat the Hamming Loss is not the quantity that is minimized, thecomputation of the output should be done by minimizing the Hammingdistance via Equation 65, which can be rewritten as:

$\begin{matrix}{{f(x)} = {\arg\mspace{14mu}{\min_{Y \in y}{\sum\limits_{k = 1}^{Q}{{y_{k} - {\sigma\left( {\left\langle {w_{k},x} \right\rangle + b_{k}} \right)}}}}}}} & (66)\end{matrix}$where σ is the linear function thresholded at −1 (resp. +1) with value−1 (resp. +1).

Referring to FIG. 16, assume there are 5 labels and there are only twopossible label sets: one denoted by Y₁ with binary representation (1, .. . , 1) and the other one Y₂=(−1, . . . , −1). The real values (w_(k),x)+−b_(k) are plotted for k=1, . . . , 5, shown as black spots. TheHamming distance between the output and Y₁ is 4, although it is 1 forY₂. Recall that the Hamming distance is computed on the signed vector ofthe output. The AM between the output and Y₁ is 4.2 although it is 5.8for Y₂. If the final output is computed with the Hamming distance, thenY₂ is chosen. If it is computed via the AHL, then Y₁ is chosen.

Choosing the Hamming Loss leads naturally to a learning system that isvery close to the one obtained from the binary approach. This suffersfrom a certain inability to separate points when simple configurationsoccur, e.g., when there are few labels and some points fall in multipleclasses. This can be addressed by using the ranking approach.

The ranking approach can be broken down into two parts, the first ofwhich is to optimize the Ranking Loss. The second part is obtained byminimizing the Size Loss 1/Q||Y|−s(x)|. This second part is actuallyalmost a regression problem and can be solved with the approaches knownin the prior art.

The goal is to define a linear model that minimizes the Ranking Losswhile having a low complexity. The notion of complexity is the margin.For systems that rank the values of <w_(k), x>+b_(k), the decisionboundaries for x are defined by the hyperplanes whose equations are<w_(k)−w_(l), x>+b_(k)−b_(l)=0, where k belongs to the label sets of xwhile l does not. So, the margin of (x, Y) can be expressed as:

$\begin{matrix}{\min\limits_{{k \in Y},{l \in \overset{\_}{Y}}}\frac{\left\langle {{w_{k} - w_{k}},x} \right\rangle + b_{k} - b_{l}}{{w_{k} - w_{l}}}} & (67)\end{matrix}$

Considering that all the data in the learning set S are well ranked, theparameters w_(k)'s can be normalized such that:<w _(k) −w _(l) ,x<+b _(k) −b _(l)≧1  (68)with equality for some x part of S, and (k,l)εY×Ŷ. Maximizing the marginon the whole learning set can then be done via:

$\begin{matrix}{{\max\limits_{w_{j},{j = 1},\ldots,Q}{\min\limits_{{{({x,Y})} \in {Sk} \in Y},{l \in \overset{\_}{Y}}}{\frac{1}{{{w_{k} - w_{l}}}^{2}}\mspace{14mu}{subject}\mspace{14mu}{to}\text{:}}}}{{{\left\langle {{w_{k} - w_{l}},x_{i}} \right\rangle + b_{k} - b_{l}} \geq 1},{\left( {k,l} \right) \in {Y_{i} \times {\hat{Y}}_{i}}}}} & (69)\end{matrix}$In the case where the problem is not ill-conditioned (two labels arealways co-occurring), the objective function can be replaced by:

$\max\limits_{w_{j}}\mspace{14mu}{\min\limits_{k.l}\frac{1}{{{w_{k} - w_{l}}}^{2}}}$previous problem can then be recast as:

$\begin{matrix}{\min\limits_{w_{j},{j = 1},\ldots,Q}\;{\max\limits_{k,l}{{w_{k} - w_{k}}}^{2}}} & (70)\end{matrix}$

subject to:<w _(k) −w _(l) ,x _(i) >+b _(k) −b _(l)≧1, (k,l)εY _(i) ×Ŷ _(i)

To provide a simpler optimization procedure, this maximum isapproximated by the sum, which leads to the following problem (in thecase the leaming set can be ranked exactly):

$\begin{matrix}{\min\limits_{w_{j},{j = 1},\ldots,Q}{\sum\limits_{k,{l = 1}}^{Q}{{w_{k} - w_{l}}}^{2}}} & (71)\end{matrix}$

subject to:<w _(k) −w _(l) ,x _(i) >+b _(k) −b _(l)≧1, (k,l)εY _(i) ×Ŷ _(i)

Note that a shift in the parameters w_(k) does not change the ranking.Thus, it can be required that

${{\sum\limits_{k = 1}^{Q}\; w_{k}} = 0},$adding this constraint. In this case, Equation 71 is equivalent to:

$\begin{matrix}{\min\limits_{w_{j},{j = 1},\mspace{11mu}\ldots\mspace{11mu},Q}{\sum\limits_{k}^{Q}\;{w_{k}}^{2}}} & (72)\end{matrix}$

subject to:<w _(k) −w _(l) ,x _(i) >+b _(k) −b _(l)≧1, (k,l)εY _(i) ×Ŷ _(i)

To generalize this problem in the case where the learning set cannot beranked exactly, the same reasoning is followed as for the binary case:the ultimate goal would be to minimize the margin and simultaneouslyminimize the Ranking Loss. The latter can be expressed directly byextending the constraints of the previous problems. If <w_(k)−w_(l),x_(i)>+b_(k)−b_(l>)1−ξ_(ikl) for (k,l)εY_(i)×Ŷ_(i), then the RankingLoss on the learning set S is:

$\begin{matrix}{\sum\limits_{i = 1}^{m}\;{\frac{1}{{Y_{i}{{\hat{Y}}_{i}}}}{\sum\limits_{k,{l \in {({Y_{l} \times \;{\hat{Y}}_{i}})}}}{\theta\left( {{- 1} + \xi_{k\; l}} \right)}}}} & (73)\end{matrix}$

Once again, the functions θ(−1+ξ_(ikt)) can be approximated by onlyξ_(ikt), which gives the fi quadratic optimization problem:

$\begin{matrix}{{\min\limits_{w_{j},{j = 1},\mspace{11mu}\ldots\mspace{11mu},Q}{\sum\limits_{k}^{Q}\;{w_{k}}^{2}}} + {C{\sum\limits_{i = 1}^{m}\;{\frac{1}{{Y_{i}{{\hat{Y}}_{i}}}}{\sum\limits_{{{({k,l})} \in {Y_{l} \times \;\hat{Y}}},}\xi_{i\; k\; l}}}}}} & (74)\end{matrix}$

subject to:<w _(k) −w _(l) ,x _(i) >+b _(k) −b _(l)≧1−ξ_(ikl), (k,l)εY_(i) ×Ŷ _(i)ξ_(iki)≧0

In the case where the label sets Y_(i) have all a size of 1, theoptimization problem is the same as has been derived for multi-classSupport Vector Machines. For this reason, the solution of this problemis referred to as a multi-label ranking Support Vector Machine MLR-SVM).

“Categorical regression” can be thought of as the problem of minimizingthe loss:

$\begin{matrix}{{C\;{R\left( {s,x,y} \right)}} = {\frac{1}{Q}{{{s(x)} - y}}}} & (75)\end{matrix}$where y is the label of x (here there is only one label for a x). Theterm “regression” derives from the fact that this cost function is a l₁cost function as in regression while “categorical” comes from the factthat s(x) is discrete. Such a setting arises when the mistakes areordered. For example, one could imagine a situation where predictinglabel 1 although the true label is 4 is worse than predicting label 3.Consider the task of predicting the quality of a web page from labelsprovided by individuals. The individuals rate the web page as “verybad”, “bad”, “neutral”, “good” or “very good”, and the point is not onlyto have few mistakes, but to be able to minimize the difference betweenthe prediction and the true opinion. A system that outputs “very bad”although the right answer is “very good” is worse than a system givingthe “good” answer. Conversely, a “very good” output although it is “verybad” is worse than “bad”. Such a categorical regression is in a wayrelated to ordinal regression, the difference being no order on thetargets is assumed; only order on the mistakes.

Categorical regression (CR) can be dealt with using the multi-labelapproach. This approach provides a natural solution of the categoricalregression problem when the right setting is used. Rather than codingthe labels as integers, encode them as binary vector of {+1, −1}components:

$\begin{matrix}{{{label}\mspace{14mu} i} = \left( {{\underset{\underset{i\mspace{14mu}{times}}{︸}}{1,\;\ldots\mspace{11mu},1} - 1},\;\ldots\mspace{11mu},{- 1}} \right)} & (76)\end{matrix}$With this encoding, the loss CR defmed previously can be expressed interms of the Hamming Loss:CR(s,x,y)=HL({tilde over (s)},x,Y)where {tilde over (s)} is the function s when its output is encoded asin Equation 76, and Y is the encoded label corresponding to y. Theminimization of the Hamming Loss leads naturally to the binary approachfor the multi-label problem. Thus, the same system as the one defined inthe binary approach section will be used. In other words, a linearsystem composed of many two-class sub-systems that learn how torecognize when the components of the label associated to x is +1 or −1will be used. All possible label sets are not allowed here. As for theECOC approach discussed previously, the function s is thus computed viaEquation 66 where

contains labels 1, . . . , Q encoded as Equation 76.

To sum up briefly, the multi-label model is decomposed into two parts,both based on dot products and linear models. The first part ranks thelabels and is obtained via Equation 74. The result is a set ofparameters (w_(k) ¹,b_(k) ¹)_(k=i, . . . , Q). The ranking is done onthe outputs r_(k)(x)=<w_(k) ¹,x>+b_(k) for k=1, . . . , Q.

The second part of the model predicts the size of each label sets and isobtained through Equation 62, where the maximum has been replaced by asum. The result is also a set of parameters (w_(k) ², b_(k)²)_(k=1, . . . , Q). The output is computed by talking the integer whoserepresentation (Equation 76) is the closest from the vector:(σ(<w ₁ ² ,x>+b ₁ ²), . . . , σ(<w _(Q) ² ,x>+b _(Q) ²))  (77)in terms of the l₁, distance. The function a is the linear functionthresholded at −1 (resp. +1) with value −1 (resp. +1). Then, the outputs(x) of this second model is used to compute the number of labels onehas to consider in the ranking obtained by the first model.

Solving a constrained quadratic problem requires an amount of memorythat is quadratic in terms of the learning set size. It is generallysolved in O(m³) computational steps where the number of labels have beenput into the O. Such a complexity is too high to apply these methods inmany real datasets. To avoid this limitation, Franke and Wolfe'slinearization method (described above regarding the AL0M embodiment) isused in conjunction with a predictor-corrector logarithmic barrierprocedure.

To solve Equation 74, the dual variables α_(ikl), ≧0 are related to theconstraints are introduced:<w _(k) −w _(l) ,x _(i) >+b _(k) −b _(l)−1+ξ_(ikl)≧0  (78)and the variables η_(ikl)≧0 related to the constraints ξ_(ikl)≧0. Thelagrangian can then be computed:

$\begin{matrix}{L = {{\frac{1}{2}{\sum\limits_{k = 1}^{Q}\;{w_{k}}^{2}}} + {C{\sum\limits_{i = 1}^{m}\;{\frac{1}{{Y_{i}{{\hat{Y}}_{i}}}}{\sum\limits_{{({k,l})} \in {Y_{l} \times \;{\hat{Y}}_{i}}}\xi_{i\; k\; l}}}}} - {\ldots\mspace{11mu}{\sum\limits_{i = 1}^{m}{\sum\limits_{{({k,l})} \in {Y_{l} \times \;{\hat{Y}}_{i}}}{\alpha_{i\; k\; l}{\quad{\left( {\left\langle {{w_{k} - w_{l}},x_{i}} \right\rangle + b_{k} - b_{l} - 1 + \xi_{i\; k\; l}} \right) - {\sum\limits_{i = l}^{m}{\sum\limits_{{({k,l})} \in {Y_{l} \times \;{\hat{Y}}_{i}}}\eta_{i\; k\; l}}}}}}}}}}} & (79)\end{matrix}$Setting ∂_(bk), L=0 at the optimum yields:

$\begin{matrix}\begin{matrix}{{\sum\limits_{i = 1}^{m}{\sum\limits_{{({j,l})} \in {({Y_{l},\;{\overset{\_}{Y}}_{i}})}}{c_{i\; j\; l}\alpha_{i\; j\; l}}}} = 0} \\{with} \\{c_{ijl} = \left\{ \begin{matrix}0 & {{{if}\mspace{14mu} j} \neq {k\mspace{14mu}{and}\mspace{14mu} l} \neq k} \\{- 1} & {{{{if}\mspace{14mu} j} = k}\mspace{11mu}} \\{+ 1} & {{{{if}\mspace{14mu} l} = k}\mspace{11mu}}\end{matrix} \right.}\end{matrix} & (80)\end{matrix}$Note that c_(ijl) depends on k. Note that the index k has been droppedto avoid excessive indices. Setting ∂_(ξikl) L=0 yields:

$\begin{matrix}{\frac{C}{{Y_{l}{{\overset{\_}{Y}}_{i}}}} = {\alpha_{i\; k\; l} + \eta_{i\; k\; l}}} & (81)\end{matrix}$Then, setting ∂_(wk)L=0 yields:

$\begin{matrix}{w_{k} = {\sum\limits_{i = 1}^{m}{\left( \mspace{11mu}{\sum\limits_{{({j,l})} \in {({Y_{l},{\hat{Y}}_{i}})}}{c_{i\; j\; l}\alpha_{i\; j\; l}}} \right)x_{i}}}} & (82)\end{matrix}$where c_(ijl) is related to index k via Equation 80. For the sake ofnotation, define:

$\begin{matrix}\begin{matrix}{\beta_{ki} = {\sum\limits_{{({j,l})} \in {({Y_{l},\;{\overset{\_}{Y}}_{i}})}}{c_{i\; j\; l}\alpha_{i\; j\; l}}}} \\{{{Then}\text{:}\mspace{14mu} w_{k}} = {\sum\limits_{i = 1}^{m}\;{\beta_{k\; i}{x_{i}.}}}}\end{matrix} & (83)\end{matrix}$The dual of Equation 74 can then be expressed. In order to have assimple notation as possible, it will be expressed with both variables,β_(ki) and α_(ikl).

$\begin{matrix}\begin{matrix}{\max - {\frac{1}{2}{\sum\limits_{k = 1}^{Q}\;{\sum\limits_{h,{i = 1}}^{m}\;{\beta_{k\; h}\beta_{ki}\left\langle {x_{h},x_{i}} \right\rangle}}}} + {\sum\limits_{i = 1}^{m}\;{\sum\limits_{{({k,l})} \in {({Y_{l},\;{\overset{\_}{Y}}_{i}})}}\alpha_{i\;{kl}}}}} \\{{{subject}\mspace{14mu}{to}\text{:}\mspace{14mu}\alpha_{i\; k\; l}} \in \overset{\alpha_{i\; k\; l}}{\left\lbrack {0,\frac{C}{C_{i}}} \right\rbrack}} \\{{{\sum\limits_{i = 1}^{m}{\sum\limits_{{({j,l})} \in {({Y_{l},\;{\overset{\_}{Y}}_{i}})}}{c_{i\; j\; l}\alpha_{i\; j\; l}}}} = 0},{{{for}\mspace{14mu} k} = 1},\;\ldots\mspace{11mu},Q}\end{matrix} & (84)\end{matrix}$The first box constraints are derived according to Equation 81, by usingthe fact that η_(ikl)≧0. The solution is a set of variables α_(ikl) fromwhich w_(k), k=1, . . . , Q can be computed via Equation 82. The biasb_(k), k=1, . . . , Q are derived by using the Karush-Kuhn-Tuckerconditions:α_(ikl)(<w _(k) −w _(l) ,x _(i))+b _(k) −b _(l)−1+ξ_(ikl))=0  (85)(C−α _(ikl))ξ_(ikl)=0  (86)For indices (i, k, l) such that α_(ikl)ε (0, C):<w _(k) −w _(l) ,x _(i) >+b _(k) =b _(l)=1

Since w_(k) and w_(l) are already known, this equation can be used tocompute the differences between b_(k) and b_(l). Note that if aprimal-dual method is used to solve the dual 10, the variables b_(k),k=1, . . . , Q are directly derived as the dual of the constraints

${\sum\limits_{i = 1}^{m}{\sum\limits_{{({j,l})} \in {({Y_{l},\;{\overset{\_}{Y}}_{i}})}}{c_{i\; j\; l}\alpha_{i\; j\; l}}}} = 0.$

The dual problem has an advantage compared to its primal counterpart.First, it contains only Q equality constraints and many box constraints.The latter are quite easy to handle in many optimization methods. Themain problem in this case concerns the objective function: it is aquadratic function that cannot be solved directly except by storing itswhole Hessian. In the case when the number of points is large as well asthe number of classes, the number of variables may be too important totake a direct approach. For this reason, an approximation scheme fromFranke and Wolfe is followed.

$\begin{matrix}{\max\limits_{\alpha}\mspace{11mu}{{g(\alpha)}\mspace{20mu}{subject}\mspace{14mu}{to}\text{:}}} & \; \\{{{\left\langle {\alpha_{k},\alpha} \right\rangle = 0},{{{for}\mspace{14mu} k} = 1},\ldots,\; Q}{\alpha_{j} \in \left\lbrack {0,C} \right\rbrack}} & (87)\end{matrix}$where the vectors α_(i) have the same dimensionality as the vector α.The procedure is as follows:

1. start with α⁰=(0, . . . , 0)

2. Assume at iterations p, α^(p) is given. Solve the linear problem:

$\begin{matrix}{\max\limits_{\alpha}\left\langle {{\nabla\;{g\left( \alpha^{p} \right)}},{\alpha\mspace{20mu}{subject}\mspace{14mu}{to}\text{:}}} \right.} & \; \\{{{\left\langle {\alpha_{k},\alpha} \right\rangle = 0},{{{for}\mspace{14mu} k} = 1},\ldots,\; Q}{\alpha_{j} \in \left\lbrack {0,C} \right\rbrack}} & (88)\end{matrix}$Let α* be the optimum.

3. Compute λε[0,1] such that g(α^(p)+λ(α*−α^(p))) is minimum. Let λ* bethis value.

4. Set α^(p+1)=α^(p)+λ*(α*−α^(p)).

5. End the procedure if λ=0 or if α^(p+1)−α^(p) has a norm lower than afixed threshold.

The purpose of this procedure is to transform a difficult quadraticproblem into many simple linear problems. As shown by Franke and Wolfe,such a procedure converges.

To apply it to the problem at hand, the gradient of the dual objectivefunction must be computed. For that purpose, new vectors are introduced:

$\begin{matrix}{v_{k}^{p} = {\sum\limits_{i = 1}^{m}{\beta_{k\; i}^{p}x_{i}}}} & (89)\end{matrix}$where β^(p) are computed from Equation 83 in terms of the α^(p) _(ikl).The latter are the current values of the parameter optimized usingFranke and Wolfe's method. At the optimum, w_(k)=v_(k). At iteration p,the objective function can thus be expressed as:

$\begin{matrix}{{{- \frac{1}{2}}{\sum\limits_{k = 1}^{Q}\;{v_{k}^{p}}^{2}}} + {\sum\limits_{i = 1}^{m}{\sum\limits_{{({k,l})} \in {({Y_{l},\;{\hat{Y}}_{i}})}}\alpha_{i\; k\; l}}}} & (90)\end{matrix}$where the first part of the equation is the I part the second part isthe J part. The J part can be differentiated directly. The I part can bedifferentiated as:

$\begin{matrix}\begin{matrix}{\frac{\partial I}{\partial\alpha_{i\; k\; l}} = {\sum\limits_{j = 1}^{Q}\left\langle {{\nabla_{v_{j}^{p}}I},{{\nabla\alpha_{i\; k\; l}}v_{j}^{p}}} \right\rangle}} \\{\mspace{34mu}{= {\left\langle {v_{k}^{p},x_{i}} \right\rangle - \left\langle {v_{l}^{p},x_{i}} \right\rangle}}}\end{matrix} & (91)\end{matrix}$

The computation of the gradient of the objective function can thus bedone directly as soon as the vectors v_(k) ^(p) are given. The vectorsare expressed in terms of the α^(p) _(ikl) and only dot products betweenthem and the x_(i)'s are needed, which means that this procedure canalso be used for kernels.

Denoting by W(α) the objective function of Equation 83 the finalalgorithm is as follows.

$\begin{matrix}{{{1.\mspace{14mu}{Start}\mspace{14mu}{with}\mspace{14mu}\alpha} = {\left( {0,\;\ldots\mspace{11mu},0} \right) \in R^{s\;\alpha}}},{{{where}\mspace{14mu} s_{\alpha}} = {\sum\limits_{i = 1}^{m}\;{{Y_{i}{{{\hat{Y}}_{i}}.}}}}}} \\{{2.\mspace{14mu}{Set}\text{:}\mspace{14mu} c_{i\; j\; l}^{k}} = \left\{ \begin{matrix}0 & {{{if}\mspace{14mu} j} \neq {k\mspace{14mu}{and}\mspace{14mu} l} \neq k} \\{- 1} & {{{{if}\mspace{14mu} j} = k}\mspace{11mu}} \\{+ 1} & {{{{if}\mspace{14mu} l} = k}\mspace{11mu}}\end{matrix} \right.} \\{{{3.\mspace{14mu}{For}\mspace{14mu} k} = 1},\;\ldots\mspace{11mu},{{Q\mspace{14mu}{and}\mspace{14mu} i} = 1},\;\ldots\mspace{11mu},m,{{{compute}\text{:}\mspace{14mu}\beta_{k\; i}} = {\sum\limits_{{({j,l})} \in {({Y_{l},\;{\hat{Y}}_{i}})}}{c_{i\; j\; l}^{k}\alpha_{i\; j\; l}}}}} \\{{{4.\mspace{14mu}{For}\mspace{14mu} k} = 1},\;\ldots\mspace{11mu},{{Q\mspace{14mu}{and}\mspace{14mu} j} = 1},\;\ldots\mspace{11mu},m,{{compute}\text{:}\mspace{11mu}{\sum\limits_{i = 1}^{m}{\beta_{k\; l}\left\langle {x_{i},x_{j}} \right\rangle}}}} \\{{5.\mspace{14mu}{Set}\mspace{14mu} g_{i\; k\; l}} = {{- \left( {\left\langle {w_{k},x_{i}} \right\rangle - \left\langle {w_{l},x_{i}} \right\rangle} \right)} + 1}} \\{{6.\mspace{14mu}{Solve}\mspace{14mu}{\min\limits_{\alpha^{new}}{\left\langle {g,\alpha^{new}} \right\rangle\mspace{14mu}{with}\mspace{14mu}{the}\mspace{14mu}{constraints}\mspace{14mu}\alpha_{ijl}^{new}}}} \in \left\lbrack {0,\frac{C}{C_{i}}} \right\rbrack} \\{\mspace{31mu}{{{{and}\mspace{14mu}{\sum\limits_{i = 1}^{m}{\sum\limits_{{({j,l})} \in {({Y_{l},\;{\hat{Y}}_{i}})}}{c_{i\; j\; l}^{k}\alpha_{i\; j\; l}}}}} = 0},{{{for}\mspace{14mu} k} = 1},\;{\ldots\mspace{11mu} Q}}}\end{matrix}$

7. Find λε

such that:

-   -   W(α+λα^(new)) be maximum and α+λα^(new) satisfies the previous        constraints.

8. Set α+λα^(new).

9. Unless convergence, go back to Step 3.

To create the non-linear version of this algorithm, just replace the dotproducts (x_(i),x_(j)) by kernels k(x_(i),x_(j)).

Using the preceding steps, the memory cost of the method becomes thenO(mQQ_(max)) where Q_(max) is the maximum number of labels in the Y. Inmany applications, the total number of labels is much larger thanQ_(max). The time cost of each iteration is O(m²Q).

EXAMPLE 6 Toy Problem

A simple toy problem is depicted in FIG. 17. There are three labels, oneof which (label 1) is associated to all inputs. The learning sets wereof size 50. Both the binary approach and the direct approach wereapplied with ranking, as described previously. Each optimization problemwas solved for C=∞). The Hamming Loss for the binary approach was 0.08although for the direct approach it was 0.

To demonstrate the nature of the error that occurs using the binaryapproach, the distribution of missed classes is represented in FIG. 18.The x-axis represents the number of erroneous labels (missed or added)for the learning set of 50 elements; the y-axis represents the number ofpoints out of 50. In the following, the error for a multi-label systemis represented using this charts and the Hamming Loss. Such arepresentation enables assessment of the quality of a multi-labelclassifier with respect to many criterion, namely, the number ofmisclassified points, the number of classes that are badly predicted,etc. Note that the Hamming Loss is proportional to the sum of the heightof the bars scaled by their coordinates.

EXAMPLE 7 Prostate Cancer Dataset

The dataset is formed by 67 examples in a space of 7129 features and of9 labels. The inputs are results of Micro-array experiments on differenttissues coming from patients with different form of Prostate Cancer. Thenine labels are described in Table 8 and represent either the positionof the tissue in the body (peripheral zone, etc . . . ) or the degree ofmalignity of the disease (G3-G4 Cancer). Note that label 4 has only onepoint which implies that a direct approach will have a leave-one-outHamming Loss of at least 0.02 (corresponding to the error when thepoints labelled Stroma is out).

TABLE 8 Label Description No. of points in class 1 Peripheral Zone 9 2Central Zone 3 3 Dysplasia (Cancer precursor stage) 3 4 Stroma 1 5Benign Prostate Hyperplasia 18 6 G3 (cancer) 13 7 G4 (cancer) 27 8 LCM(Laser confocal microscopy) 20 9 2-amplifications (related to PCR) 34

Two labels have a particular meaning. The LCM label is associated withexamples that have been analyzed with a Laser Confocal Microscopytechnique, and the 2-amplifications label refer to examples that havehad two amplifications during the PCR. These two labels are discarded sothat the remaining problem is one of separating different tissue types,thus reducing the number of labels to 7. The label sets have a maximumsize of 2. To assess the quality of the classifier, a leave-one-outestimate of the Hamming Loss was computed for the direct approach aswell as for the binary approach with linear models and a constant C=∞.This produced an error of 0.11 for the direct approach and of 0.09 forthe binary approach. FIGS. 19 a-c shows the distribution of the errorsfor the direct and the binary approach in the leave-one-out estimate ofthe Hamming Loss. FIG. 19 a shows the errors for the direct approach,where the values of the bars are from left to right: 4, 19 and 3. FIG.19 b shows the errors for the binary approach where the values of thebars are from left to right: 20, 10 and 1. FIG. 19 c shows the errorsfor the binary approach when the system is forced to output at least onelabel, where the values of the bars are from left to right: 9, 16 and 2.

The direct approach yields mistakes on 26 points although the binaryapproach makes mistakes on 31 points. In terms of the Hamming Loss, thebinary approach is the best although in terns of the number of correctlyclassified input points, the direct approach wins. As previouslydiscussed, the binary approach is naturally related to the Hamming Lossand it seems quite natural that it minimizes its value. On the otherhand, if one wants to minimize the number of points where there is anerror, the direct approach is better.

In the direct approach, the empty label set output is not possible. Ifthe binary system is forced to provide at least one label on all theinput sample, the Hamming Loss for the binary becomes 0.10, and thenumber of points where there is an error is 27. Thus, both direct andbinary systems are quite close.

Since the goal in this particular application is to discriminate themalign cells from the benign ones, the multi-labelled approach wasapplied for labels 4-7 only, reducing the number of labels to 4 and thenumber of points in the learning set to 56. With the same setting asabove, the leave-one-out estimate of the Hamming Loss was 0.14 for thedirect approach, and 0.14 for the binary approach. The same experimentas above was performed by computing the Hamming Loss of the binarysystem when the latter is forced to give at least one label, producing aHamming Loss of 0.16. In this particular application, the directapproach is better, yielding a system with the same or a lower HammingLoss than the binary approach.

The results of the leave-one-out estimate of the Hamming Loss for theProstate Cancer Database using 4 labels are provided in FIG. 20 a-c.FIG. 20 a is a histogram of the errors for the direct approach, wherethe values of the bars are from left to right: 2, 13 and 1). FIG. 20 bis a histogram of the errors for the binary approach, where the valuesof the bars are from left to right: 12, 8 and 1. FIG. 20 c is ahistogram of the errors for the binary approach when the system isforced to output at least one label, where the values of the bars arefrom left to right: 7, 13 and 1.

The partial conclusion of these experiments is that defining a directapproach does not provide significant improvement over a simple binaryapproach. Given the nature of the problem we have considered, such aconclusion seems quite natural. It is interesting to note that both thedirect and the binary approach give similar results even though thedirect approach has not been defined to minimize the Hamming Loss.

The above methods can be applied for feature selection in a multi-labelproblem in combination with the method for minimization of the l_(o)norm of a linear system that was discussed relative to the previousembodiment. Note that multi-label problems are omnipresent in biologicaland medical applications.

Consider the following multiplicative update rule method:

1. Inputs: Learning set S=((x_(i), Y₁))_(i−1, . . . , m).

2. Start with w₁, b₁, . . . , w_(Q), b_(Q))=(0, . . . , 0).

3. At iteration t, solve the problem:

$\begin{matrix}{\min\limits_{w_{k},b_{k}}{\sum\limits_{k = 1}^{m}\;{w_{k}}^{2}}} & (92)\end{matrix}$

subject to:<w _(k) *w _(k) ^(t) ,x _(i) >−<w _(l) *w _(l) ^(t) ,x _(i) >+b _(k) −b_(l)≧1, for (k,l)εY _(i) ×Ŷ _(i)where w* w^(t) is the component-wise multiplication between w and w^(t).

4. Let w be the solution. Set: w^(t+1)=w*w^(t).

5. Unless w^(i+1)=w^(t), go back to step 2.

6. Output w^(t+1).

This method is an approximation scheme to minimize the number ofnon-zero components of (w₁, . . . , w_(Q)) while keeping the constraintsof Equation 92 satisfied. It is shown to converge in the priordiscussion. If the multi-label problem is actually a ranking problem,then this feature selection method is appropriate.

For real multi-label problems, a step is still missing, which is tocompute the label sets size s(x).

Using the Prostate Cancer Database, assume that the value of the labelset sizes is known. First, consider the entire dataset and compute howmany features can be used to find the partial ranking between labels.Nine features are sufficient to rank the data exactly. Next, use thefeature selection method in conjunction with a learning step afterwardsto see if choosing a subset of features provide an improvement in thegeneralization performance. For this purpose, a leave-one-out procedureis used. For each point in the learning set, remove that point, performthe feature selection and run a MLR-SVM with the selected features,compute the Hamming Loss on the removed point. All SVMs were trainedwith C=∞. The results are summarized in Table 9. The mean of features isgiven with its standard deviation in parenthesis. The number of errorscounts the number of points that have been assigned to a wrong labelset.

TABLE 9 No. of Features Hamming Loss No. of Errors Direct + Direct 9.5(±1) 0.10 11 Direct + Binary 9.5 (±1) 0.11 19

FIG. 21 shows the distribution of the mistakes using the leave-one-outestimate of the Hamming Loss for the Prostate Cancer Database using 4labels with Feature selection. FIG. 21 a is a histogram of the errorsfor the direct approach where the value of one bars is 11. FIG. 21 b isa histogram of the errors for the binary approach, where the values ofthe bars are from left to right: 12 and 7. This comparison may seem alittle biased since the function s(x) that computes the size of thelabel sets is known perfectly. Note, however, that for the binaryapproach, feature selection preprocessing provides a betterleave-one-out Hamming Loss. There are also fewer mislabeled points thanthere are if no feature selection is used.

Although the preceding discussion has covered only linear systems, theinventive method can be applied to more complex problems when onlydot-products are involved. In this case, the kernel trick can be used totransform a linear model into a potentially highly non-linear one. To doso, it suffices to replace all the dot products (x_(i),x_(j)) byk(x_(i),x_(j)) where k is a kernel. Examples of such kernels are thegaussian kernel exp (−∥x_(i)−x_(j)∥²/σ²) or the polynomial kernel(<x_(i),x_(j)>+1)^(d) for d εN\{0}. If these dot products are replaced,it becomes apparent that the other dot products (w_(k), x_(t)) can alsobe computed via Equation 81 and by linearity of the dot product. Thus,the MLR-SVM and the ML-SVM introduced can also be defined for differentdot product computed as kernels.

Feature Selection by Unbalanced Correlation.

A fourth method for feature selection used the unbalanced correlationscore (CORR_(ub)) according to criterion

$\begin{matrix}{f_{i} = {\sum\limits_{i}{X_{i\; j}Y_{i}}}} & (93)\end{matrix}$where the score for feature j is ƒ_(j) and a larger score is assigned toa higher rank. Before this method is described, note that due to thebinary nature of the features, this criterion can also be used to assignrank to a subset of features rather than just a single feature. This canbe done by computing the logical OR of the subset of features. If thisnew vector is designated as x, compute

$\sum\limits_{i}{x_{i}{Y_{i}.}}$A feature subset which has a large score can thus be chosen using agreedy forward selection scheme. This score favors features correlatedwith the positive examples, and thus can provide good results for veryunbalanced datasets.

The dataset used for testing of the CORR_(ub) feature selection methodconcerns the prediction of molecular bioactivity for drug design. Inparticular, the problem is to predict whether a given drug binds to atarget site on thrombin, a key receptor in blood clotting. The datasetwas provided by DuPont Pharmaceuticals.

The data was split into a training and a test set. Each example(observation) has a fixed length vector of 139,351 binary features(variables) in {0,1}. Examples that bind are referred to as having label+1 (and hence being called positive examples). Conversely, negativeexamples (that do not bind) are labeled −1.

In the training set there are 1909 examples, 42 of which are which bind.Hence the data is rather unbalanced in this respect (42 positiveexamples is 2.2% of the data). The test set has 634 examples. The taskis to determine which of the features are critical for binding affinityand to accurately predict the class values using these features.

Performance is evaluated according to a weighted accuracy criterion dueto the unbalanced nature of the number of positive and negativeexamples. That is, the score of an estimate Ŷ of the labels Y is:

$\begin{matrix}\begin{matrix}{{l_{weighted}\left( {Y,\hat{Y}} \right)} = {{\frac{1}{2}\left( \frac{\#\left\{ {{\hat{Y}:Y} = {{1\bigwedge\hat{Y}} = 1}} \right\}}{\#\left\{ {{Y:Y} = 1} \right\}} \right)} +}} \\{\frac{\#\left\{ {{\hat{Y}:Y} = {{{- 1}\bigwedge\hat{Y}} = {- 1}}} \right\}}{\#\left\{ {{Y:Y} = {- 1}} \right\}}}\end{matrix} & (94)\end{matrix}$where complete success is a score of 1. In this report we also multiplythis score by 100 and refer to it as percentage (weighted) success rate.

Also of interest is (but less important) is the unweighted success rate.This is calculated as:

$\begin{matrix}{{l_{unweighted}\left( {Y,\hat{Y}} \right)} = {\sum\limits_{l}{\frac{1}{2}{{{{\hat{Y}}_{i} - Y_{i}}}.}}}} & (95)\end{matrix}$

An important characteristic of the training set is that 593 exampleshave all features equal to zero. 591 of these examples are negative and2 of these are positive (0.3%). In the training set as a whole, 2.2% arepositive, so new examples with all features as zero should be classifiednegative. As not much can be learned from these (identical) examples,all but one were discarded, which was a negative example. The primarymotivation for doing this was to speed up computation during thetraining phase

A summary of some characteristics of the data can be found in Table 10below. Considering the training data as a matrix X of size 1909×139351,the number of nonzero features for each example, and the number ofnonzero values for each feature were computed.

TABLE 10 Type min max mean median std Nonzeroes in example 0 29744 950136 2430 i, s_(i) = Σ_(j)X_(ij) Nonzeroes in feature 1 634 13.01 5 22.43j, f_(j) = Σ_(i)X_(ij)The results show that the data is very sparse. Many features have only afew nonzero values. If there are noisy features of this type, it can bedifficult to differentiate these features from similarly sparse vectorswhich really describe the underlying labeling. This would suggest theuse of algorithms which are rather suspicious of all sparse features (asthe useful ones cannot be differentiated out). The problem, then, iswhether the non-sparse features are discriminative.

The base classifier is a SVM. The current experiments employ only linearfunctions (using the kernel K=X*X′) or polynomial functionsK=(X*X′+1)^(d), where X is the matrix of training data.

For use in unbalanced datasets, methods are used to control trainingerror. To introduce decision rules which allow some training error (SVMsotherwise try to correctly separate all the data) one can modify thekernel matrix by adding a diagonal term. This controls the trainingerror in the following way. An SVM minimizes a regularization term R (aterm which measures the smoothness of the chosen function) and atraining error term multiplied by a constant C:R+C*L. Here,

$L = {\sum\limits_{i}\xi_{i}^{2}}$where ξ_(i)=Y_(i)−Ŷ_(i) (the estimate Ŷ of the label Y for example i isa real value) and so the size of the diagonal term is proportional to1/C. Large values of the diagonal term will result in large trainingerror.

Although there are other methods to control training error, thefollowing method is adopted for use on unbalanced datasets. In thismethod, there is a trade off of errors between negative and positiveexamples in biological datasets achieved by controlling the size of thediagonal term with respect to the rows and columns belonging to negativeand positive examples. The diagonal term is set to be n(+)/n*m forpositive examples and n(−)/n*m for negative examples, where n(+) andn(−) are the number of positive and negative examples respectively,n=n(+)+n(−) and m is the median value of the diagonal elements of thekernel.

After training the SVM and obtaining a classifier control of the falsepositive/false negative trade off, ROCS (receiver operatingcharacteristics) is evaluated for this classifier. This can be done byconsidering the decision function of an SVM which is a signed real value(proportional to the distance of a test point from the decisionboundary). By adding a constant to the real value before talking thesign, a test point is more likely to be classified positive.

A very small number of features (namely, 4) was chosen using the RFE,l₀-norm and CORR_(ub) three feature selection methods to provide acomparison. For CORR_(ub) the top 4 ranked features are chosen usingEquation 93. In all cases, the features chosen were tested via crossvalidation with the SVM classifier. The cross validation was eight-fold,where the splits were random except they were forced to be balanced,i.e., each fold was forced to have the same ratio of positive andnegative examples. (This was necessary because there were only 42positive examples (40, if one ignores the ones with feature vectors thatare all zeros). If one performed normal splits, some of the folds maynot have any positive examples.) The results are given in Table 11below, which shows the cross validation score (cv), the training scoreand the actual score on the test set for each features selection method.The score function is the l_(weighted) of Equation 94, i.e., optimalperformance would correspond to a score of 1.

TABLE 11 cv train test RFE 0.5685 ± 0.1269 0.7381 0.6055 l₀-C 0.6057 ±0.1264 0.7021 0.4887 CORR_(ub) 0.6286 ± 0.1148 0.6905 0.6636Note that the test score is particularly low on the l₀-C method. Recallthat the balance between classifying negative and positive pointscorrectly is controlled by the diagonal term added to the kernel, whichwas fixed a priori. It is possible that by controlling thishyperparameter one could obtain better results, and it may be that thedifferent systems are unfairly reflected, however this requiresretraining with many hyperparameters. Compensation for the lack oftuning was attempted by controlling the threshold of the real valuedfunction after training. In this way one obtains control of the numberof false negatives and false positives so that as long as the classifierhas chosen roughly the correct direction (the correct features andresulting hyperplane), results can improve in terms of the success rate.To verify this method, this parameter was adjusted for each of thealgorithms, then the maximum value of the weighted CV (cross validation)success rate was taken. This is shown in Table 12 with the valuescv_(max), train_(max) and test_(max).

TABLE 12 cm_(max) train_(max) test_(max) RFE 0.6057 ± 0.1258 0.73810.6137 l₀-C 0.6177 ± 0.1193 0.8079 0.5345 CORR_(ub) 0.6752 ± 0.17880.8669 0.7264

The best performance on the test set is 72.62%. The best results weregiven by the unbalanced correlation score (CORR_(ub)) method. The twoother features selection methods appear to overfit. One possible reasonfor this is these methods are not designed to deal with the unbalancednature of the training set. Moreover, the methods may be finding a toocomplex combination of features in order to maximize training score(these methods choose a subset of features rather than choosing featuresindependently). Note also that both methods are in some sense backwardselection methods. There is some evidence to suggest that forwardselection methods or methods which treat features independently may bemore suitable for choosing very small numbers of features. Their failureis entirely plausible in this situation given the very large number offeatures.

The next step in the analysis of the dataset was to select the type ofkernel to be used. Those considered were linear kernels and polynomialsof degrees 2 and 3. The results are given in Tables. 13 and 14.

TABLE 13 cv train test CORR_(ub) linear 0.6286 ± 0.1148 0.6905 0.6636CORR_(ub) poly 2 0.6039 ± 0.1271 0.6905 0.6165 CORR_(ub) poly 3 0.5914 ±0.1211 0.6429 0.5647

TABLE 14 cv_(max) train_(max) test_(max) CORR_(ub) linear 0.6752 ±0.1788 0.8669 0.7264 CORR_(ub) poly 2 0.6153 ± 0.1778 0.8669 0.7264CORR_(ub) poly 3 0.6039 ± 0.1446 0.8201 0.7264

Next, the number of features chosen by the CORR_(ub) method wasadjusted. 20 The greedy algorithm described above was used to add n=1, .. . , 16 more features to the original 4 features that had been selectedindependently using Equation 93. Tests were also run using only 2 or 3features. The results are given in the Tables 15a, b and c.

TABLE 15a features 2 3 4 5 6 7 8 train 0.5833 0.6905 0.6905 0.72620.7500 0.8212 0.8212 test 0.4979 0.6288 0.6636 0.6657 0.6648 0.72470.7221 cv 0.5786 0.6039 0.6286 0.6745 0.7000 0.6995 0.6992 (cv-std)0.0992 0.1271 0.1448 0.2008 0.1876 0.1724 0.1722

TABLE 15b features 9 10 11 12 13 14 15 train 0.8212 0.8212 0.8093 0.80950.8095 0.8095 0.8452 test 0.7449 0.7449 0.7449 0.7336 0.7346 0.73460.7173 cv 0.7341 0.7341 0.7216 0.7218 0.7140 0.7013 0.7013 (cv-std)0.1791 0.1791 0.1773 0.1769 0.1966 0.1933 0.1933

TABLE 15c features 16 17 18 19 20 train 0.8452 0.8452 0.8571 0.85710.8571 test 0.6547 0.6701 0.6619 0.6905 0.6815 cv 0.7013 0.7013 0.70130.7010 0.7007 (cv-std) 0.1933 0.1933 0.1933 0.1934 0.1940

In Tables 16a-c, the same results are provided showing the maximum valueof the weighted success rate with respect to changing the constantfactor added to the real valued output before thresholding the decisionrule (to control the tradeoff between false positives and falsenegatives). The best success rate found on the test set using thismethod is 75.84%. Cross-validation (CV) results indicate that 9 or 1 0features produce the best results, which gives a test score of 74.49%.

TABLE 16a features 2 3 4 5 6 7 8 train_(max) 0.8431 0.8669 0.8669 0.92640.9502 0.8212 0.8212 test_(max) 0.5000 0.6308 0.7264 0.7264 0.72750.7247 0.7250 cv_(max) 0.6624 0.6648 0.6752 0.7569 0.7460 0.7460 0.7489(cv_(max)-std) 0.1945 0.0757 0.1788 0 0.1949 0.1954 0.1349

TABLE 16b features 9 10 11 12 13 14 15 train_(max) 0.8212 0.8212 0.82120.8450 0.8450 0.8450 0.8571 test_(max) 0.7449 0.7449 0.7472 0.75740.7584 0.7584 0.7421 cv_(max) 0.7583 0.7599 0.7724 0.7729 0.7731 0.77340.7713 (cv_(max)-std) 0.1349 0.1349 0.1349 0.1245 0.1128 0.1187 0.1297

TABLE 16c features 16 17 18 19 20 train_(max) 0.8571 0.8571 0.86900.8690 0.8690 test_(max) 0.7454 0.7277 0.7205 0.7289 0.7156 cv_(max)0.7963 0.7838 0.7815 0.7713 0.7669 (cv_(max)-std) 0.1349 0.1297 0.13490.1245 0.1349

Note that the training score does not improve as expected when morecomplicated models are chosen. This is likely the result of two factors:first, the size of the diagonal term on the kernel may not scale thesame resulting in a different value of the cost function for trainingerrors, and second (which is probably the more important reason)training error is only approximately minimized via the cost functionemployed in SVM which may suffer particularly in this case of unbalancedtraining data.

Based on the CV results, linear functions were selected for the rest ofthe experiments.

To provide a comparison with the SVM results, a C4.5 (decision tree)classifier and k-nearest neighbours (k-NN) classifier were each run onthe features identified by cross validation. SVMs gave a test success of0.7449, but standard C4.5 gave a test success of 0.5. The problem isthat the standard version of C4.5 uses the standard classification lossas a learning criterion and not a weighted loss. Therefore, it does notweight the importance of the positive and negative examples equally and,in fact, in the training set classifies all examples negative. “Virtual”training examples were created by doubling some of the points to makethe set seem more balanced, however, this did not help. It appears thatthe C4.5 algorithm internally would need to be modified internally tomake it work on this dataset.

In the k-NN comparison, the decision rule is to assign the class of themajority of the k-nearest neighbors x_(i=1, . . . , l) of a point x tobe classified. However, the distance measure was altered so that ifx_(i) is a positive example, the measure of distance to x was scaled bya parameter λ. By controlling λ, one controls the importance of thepositive class. The value of λ could be found by maximizing over successrates on the training set, but in the present experiments, the maximumperformance on the test set over the possible choices of λ to get anupper bound on the highest attainable success rate of k-NN was observedThe results for this method (k-NN_(max)) and conventional k-NN are givenin Table 17. These results fall short of the previously-described SVMresults (0.7449) with SVM outperforming both variations of k-NN for allvalues of the hyperparameters.

TABLE 17 k 1 2 3 4 5 6 7 8 k-NN 0.57 0.20 0.64 0.63 0.63 0.37 0.50 0.50k-NN_(max) 0.59 0.35 0.65 0.65 0.65 0.41 0.50 0.50

The present embodiment of the feature selection method was comparedagainst another standard: correlation coefficients. This method ranksfeatures according to the score

${\frac{\mu_{( + )} - \mu_{( - )}}{\sigma_{( + )} + \sigma_{( - )}}},$where μ₍₊₎ and μ⁽⁻⁾ are the mean of the feature values for the positiveand negative examples respectively, and σ₍₊₎ and σ⁽⁻⁾ are theirrespective standard deviations.

The results of this comparison are given below in Table 18. Overall,correlation coefficients are a poor feature selector, however, SVMsperform better than either of the other classifiers (k-NN or C4.5) usingfeatures selection based on correlation coefficients.

TABLE 18 features 2 4 6 8 10 12 14 16 18 SVM_(max) 0.62 0.62 0.59 0.590.52 0.52 0.50 0.51 0.50 k-NN_(max) 0.62 0.59 0.53 0.55 0.53 0.51 0.520.51 0.52 C4.5 0.62 0.57 0.55 0.51 0.49 0.52 0.53 0.51 0.51

Also considered was the following feature selection scoring method whichinvolved selecting the features with the highest unbalanced correlationscore without any entry assigned to negative samples. This can beachieved by computing the score

$f_{i} = {{\sum\limits_{Y_{i} = 1}\; X_{ij}} - {\lambda{\sum\limits_{Y_{i} = {- 1}}X_{ij}}}}$for large values of λ. However, in practice, if λ≧3, the results are thesame on this dataset.

The results for the unbalanced correlation score with no negativeentries are shown in Table 19 using an OR function as the classifier.The test success rates are slightly better than those obtained for theprevious correlation score (in which λ=1), but the advantage is thatthere are fewer hyperparameters to adjust. Unfortunately, the crossvalidation results do not accurately reflect test set performance,probably because of the non-iid nature of the problem. Because of itssimplicity, this feature selection score is used in the followingdescription of transductive algorithms.

TABLE 19a Feat. 1 2 3 4 5 6 7 8 9 10 Tr. 66.6 66.6 66.6 66.6 66.6 66.671.4 76.1 76.1 76.1 Ts. 63.0 72.8 74.6 74.9 74.99 75.3 76.9 74.4 74.774.6

TABLE 19b Feat. 11 12 13 14 15 16 17 18 19 20 Tr. 76.1 76.1 76.1 76.176.1 77.3 77.3 77.3 79.7 80.9 Ts. 73.4 73.4 73.4 76.2 76.2 75.7 75.073.8 69.5 65.0

In inductive inference, which has been used thus far in the learningprocess, one is given data from which one builds a general model andthen applies this model to classify new unseen (test) data. There alsoexists another type of inference procedure, so called transductivelearning, in which one takes into account not only given data (thetraining set) but also the (unlabeled) data that one wishes to classify.In the strictest sense, one does not build a general model (fromspecific to general), but classifies the test examples directly usingthe training examples (from specific to specific). As used herein,however, algorithms are considered which take into account both thelabeled training data and some unlabeled data (which also happens to bethe test examples) to build a more accurate (general) model in order toimprove predictions.

Using unlabeled data can become very important with a dataset wherethere are few positive examples, as in the present case. In thefollowing sections, experiments with a simple transduction algorithmusing SVMs are shown to give encouraging results.

A naive algorithm was designed to take into account information in thetest. First, this algorithm used the SVM selected via cross validation.Then, calculated the real valued output y=g(x) of this classifier on thetest set was calculated. (Before taking the sign—this is proportional tothe distance from the hyperplane of a test point). These values werethen used as a measure of confidence of correct labeling of thosepoints.

Next, two thresholds t₁<t₂ were selected to give the setsA={x_(i):g(x_(i))<t₁} and B={x_(i):g(x_(i))>t₂}, where it is assumedthat the points in set A have label −1 and B have label +1. Thethresholds t₁ and t₂ should be chosen according to the conditionalprobability of y given the distance from the hyperplane. In thisexperiment the thresholds were hand-selected based upon the distributionof correctly labeled examples in cross validation experiments.

Feature selection is performed using CORR_(ub) using both the trainingset and the sets A and B to provide a sort of enlarged training set.Then, a SVM is trained on the original training set with the n bestfeatures from the resulting correlation scores. For each n, the bestscores of the SVM were calculated by adjusting the threshold parameter.

This scheme could be repeated iteratively. The results of thetransductive SVM using CORR_(ub) feature selection compared with theinductive version, shown in Tables 20a and b and FIG. 22, were quitepositive. The best success rate for the transductive algorithm is 82.63%for 30-36 features.

TABLE 20a feature 2 4 6 8 10 SVM-TRANS_(max) 0.7115 0.7564 0.7664 0.79790.8071 SVM_(max) 0.7574 0.7264 0.7275 0.7250 0.7449

TABLE 20b feature 12 14 16 18 20 SVM-TRANS_(max) 0.8152 0.8186 0.82060.8186 0.8219 SVM_(max) 0.7574 0.7584 0.7454 0.7205 0.7156

In order to simplify the transduction experiments, a transductiveapproach of the CORR_(ub) method was applied. This should remove some ofthe hyperparameters (e.g the hyperparameters of the SVM) and make thecomputations faster, facilitating computation of several iterations ofthe transduction procedure, and the corresponding cross validationresults.

To apply the transductive approach, a simple probabilistic version ofthe algorithm set forth above was designed:

-   -   1. Add the test examples to the training set but set their        labels y_(i)=0 so their labels are unknown.    -   2. Choose n features using the score for a feature j,

${f_{j} = {{\sum\limits_{Y_{i} > 0}{Y_{i}X_{ij}}} + {3{\sum\limits_{Y_{i} < 0}{Y_{i}X_{ij}}}}}};$

-   -    (the CORR_(ub) method).    -   3. For each test example recalculate their labels to be:        y_(i)=ƒ(g(x_(i))) where g(x_(i)) is the sum of the values of the        features (the features were selected in the previous step). The        function ƒ can be based on an estimate of the conditional        probability of y given g(x_(i)). In this case, the selected        function was ƒ(x)=tan h(sx/n−b) where s and b were estimated to        be the values s=4 and b=0.15. These values were selected by        performing cross validation on different feature set sizes to        give the values g(x_(i)) of left out samples. The values are        then selected by measuring the balanced loss after assigning        labels.    -   4. Repeat from step 2 until the features selected do not change        or a maximum number of iteration have occured.

FIG. 23 shows the results of the transductive CORR_(ub) ² methodcompared to the inductive version for from 4 to 100 features. The bestsuccess rate for the transductive algorithm is 82.86%. Again, thetransductive results provide improvement over the inductive method. Thetransductive results are particularly robust with increasing numbers offeatures, up to one hundred features. After 206 features the resultsstart to diminish, yielding only 77% success. At this same point, theinductive methods give 50% success, i.e., they no longer learn anystructure. For 1000 features using the transductive method one obtains58% and for 10000 features one no longer learns, obtaining 50%. Notealso that for CORR (the inductive version), the cross validation resultsdo not help select the correct model. This may be due to the smallnumber of positive examples and the non-iid nature of the data. As thecross validation results are more or less equal in the transductive caseas the number of features increases, it would be preferable to choose asmaller capacity model with less features, e.g, 10 features. Table 21provides the results of the tranductive CORR_(ub) ² method, while Table22 provides the results of the CORR_(ub) ² inductive method, for testsuccess, cross validation success and the corresponding standarddeviation.

TABLE 21 features 1 5 10 15 20 25 30 35 40 TRANS 0.6308 0.7589 0.81630.8252 0.8252 0.8252 0.8252 0.8252 .08252 test TRANS 0.6625 0.67460.6824 0.6703 0.6687 0.6687 0.6683 0.6683 0.6683 cv TRANS 0.074 0.0890.083 0.088 0.089 0.090 0.089 0.089 0.089 cv-std

TABLE 22 features 1 11 21 31 41 CORR_(ub) ² test 0.6308 0.7315 0.61650.5845 0.5673 CORR_(ub) ² cv 0.5176 0.6185 0.6473 0.6570 0.6677CORR_(ub) ² cv-std 0.013 0.022 0.032 0.04 0.04

In summary, feature selection scores and forward selection schemes whichtake into account the unbalanced nature of the data, as in the presentexample, are less likely to overfit than more complex backward selectionschemes. Transduction is useful in taking into account the differentdistributions of the test sets, helping to select the correct features.

According to the present invention, a number of different methods areprovided for selection of features in order to train a learning machineusing data that best represents the essential information to beextracted from the data set. The inventive methods provide advantagesover prior art feature selection methods by taking into account theinterrelatedness of the data, e.g., multi-label problems. The featuresselection can be performed as a pre-processing step, prior to trainingthe learning machine, and can be done in either input space or featurespace.

It should be understood, of course, that the foregoing relates only topreferred embodiments of the present invention and that numerousmodifications or alterations may be made therein without departing fromthe spirit and the scope of the invention as set forth in the appendedclaims. Such alternate embodiments are considered to be encompassedwithin the spirit and scope of the present invention. Accordingly, thescope of the present invention is described by the appended claims andis supported by the foregoing description.

1. A computer-implemented method for identifying a pattern within alarge dataset, wherein data points in the dataset correspond to aphysical measurement and have a plurality of features that describeattributes of the data point, the method comprising: inputting the largedataset into a computer system having a processor and a memory;selecting a feature subset of the large number of features forprocessing in a learning machine by executing a feature selectionalgorithm on the dataset to identify the subset of features, wherein thealgorithm is selected from the group consisting of l₀-norm minimizationand unbalanced correlation, wherein l₀-norm minimization comprisesfinding a smallest number of non-zero elements of a weight vector w inthe relationship D(x)=w·x+b, and unbalanced correlation comprisesdividing the dataset into unbalanced groups of positive and negativeexamples and ranking features according to a success criterion forcorrectly classifying the dataset; processing the feature subset of thedata points of the large dataset to identify the pattern; and generatingan output to a printer or display device, the output comprising theidentified pattern within the large dataset for the feature subset. 2.The method of claim 1, wherein the learning machine is a support vectormachine.
 3. The method of claim 1, wherein the algorithm is l₀-normminimization and the dataset comprises a multi-label dataset, andfurther comprising calculating a label set size.
 4. The method of claim3, wherein the step of calculating label set size comprises minimizing aranking loss.
 5. The method of claim 3, wherein the dataset comprisesgene expression data obtained from DNA micro-arrays and the outputcomprises identities of a group of genes for detecting a disease orcondition.
 6. The method of claim 1, wherein the algorithm is l₀-normminimization and the algorithm comprises approximately minimizingl₀-norm by minimizing l₁-norm.
 7. The method of claim 1, wherein thealgorithm is l₀-norm minimization and the algorithm comprisesapproximately minimizing l₀-norm by minimizing l₂-norm.
 8. The method ofclaim 1, further comprising mapping the dataset into feature space priorto executing the feature selection algorithm.
 9. The method of claim 1,wherein the algorithm is unbalanced correlation and the datasetcomprises labeled and unlabeled data, the method further comprisingusing transductive learning to classify the unlabeled data prior toexecuting the feature selection algorithm.
 10. The method of claim 1,wherein the dataset comprises gene expression data obtained from DNAmicro-arrays and the output comprises identities of a group of genes fordetecting a disease or condition.
 11. A computer-implemented method foridentifying a pattern within a large dataset, wherein the data points inthe dataset correspond to physical measurements and each data point hasa plurality of features that describe attributes of the data point, themethod comprising: inputting the large dataset into a computer systemhaving a processor and a memory; selecting a feature subset of the largenumber of features for processing in a learning machine by executing afeature selection algorithm on the dataset to identify the subset offeatures, wherein the algorithm comprises l₀-norm minimization, whereinl₀-norm minimization comprises finding a smallest number of non-zeroelements of a weight vector w in the relationship D(x)=w·x+b; processingthe feature subset of the data points of the large dataset to identifythe pattern; and generating an output to a printer or display device,the output comprising the identified pattern within the large datasetfor the feature subset.
 12. The method of claim 11, wherein the learningmachine is a support vector machine.
 13. The method of claim 11, whereinthe dataset comprises a multi-label dataset, and further comprisingcalculating a label set size.
 14. The method of claim 13, wherein thestep of calculating label set size comprises minimizing a ranking loss.15. The method of claim 11, wherein the dataset comprises geneexpression data obtained from DNA micro-arrays and the output comprisesidentities of a group of genes for detecting a disease or condition. 16.The method of claim 11, wherein the algorithm comprises approximatelyminimizing l₀-norm by minimizing l₁-norm.
 17. The method of claim 11,wherein the algorithm comprises approximately minimizing l₀-norm byminimizing l₂-norm.
 18. The method of claim 11, further comprisingmapping the dataset into feature space prior to executing the featureselection algorithm.
 19. The method of claim 10, wherein the largedataset is a prostate cancer database.
 20. The method of claim 15,wherein the large dataset is a prostate cancer database.